Download Using Data Mining Technique to Classify Medical Data Set

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia, lookup

Nonlinear dimensionality reduction wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

K-means clustering wikipedia, lookup

Cluster analysis wikipedia, lookup

Transcript
Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015
Using Data Mining Technique to Classify
Medical Data Set
Mohammed Abdullah Naser
Department of Computer, College of Sciences for Women, University of Babylon
Mohamed_1276@yahoo.com
Fryal Jassim Abd Al_Razaq
Software Department , College of Information Technology , University of Babylon
Fryal.jassim@yahoo.com
Abstract
Data mining is an iterative and multistage process to extract a useful knowledge. The real life
data mining applications are interesting because they often present a different set of problems for data
miners; one of such applications that we have done is on the diabetic patients databases. Medical data
mining is the process of extracting hidden patterns from medical data.
In this research, a data mining method using an enhanced RIPPER classification algorithm has
been designed and implemented. This algorithm is rule-based classifier which is based on rule growing
and pruning .It apply on the Pima Indians of National Institute of Diabetes Datasets and Mirjan
hospital Diabetes Datasets in order to satisfy acceptable accuracy percentage of classification rate
compare with other existing classification algorithms. The data set is loaded to the system and split it to
patient and non-patient based on the patient class, and then the initial rules are applied to obtain the
first classification percent. Information gain is applied in each classified step as an indicator for the
important rule. In the final step, rules are growing based on unification method that applied in each
step, where the rules with the high rank are merged to obtain a new rule. The obtained results were
small in the first growing process. Accuracy rate is increased during the merging rules process. In final
iteration, we obtained accuracy detection is higher than in the standard existing algorithm.
Keywords: Data Mining Technique, Medical Data Set
‫الخالصة‬
‫ تأخذ تطبيقات تنقيب البيانات التي تستخدم‬.‫تعدين (تنقيب) البيانات عملية تك اررية ومتعددة المراحل إلنتزاع تلك المعرفة المفيدة‬
‫ مثل تلك التطبيقات هو ما عملناه على بيانات‬،‫بيانات حياتية حقيقية اهمية كبيرة لكونها غالباً ما تحوي مشاكل متنوعة تستحق التنقيب‬
.‫حيث ان تتقيب البيانات الطبيية تعني عملية إنتزاع أنماط مخفية من البيانات الطبية‬.‫مرضى السكري‬
‫) التصنيفية والتي تم‬RIPPER( ‫ تم استخدام طريقة من طرق تنقيب البيانات وذلك من خالل تحسين خوارزمية‬،‫في هذا البحث‬
‫تم تطبيق‬.‫وهذه الخوارزمية من الخوارزميات التي تعتمد القواعد في تصنيفها من خالل تنمية القواعد وتهذيبها‬.‫تصميمها وتنفيذها‬
‫) لمرضى السكري باالضافة الى بيانات مرضى السكري لمستشفى‬PIMA( ‫لمحسنة على بيانات معروفة بأسم‬
ّ ‫خوارزمية التصنيف ا‬
‫حمل إلى النظام ويتم‬
ّ ‫ البيانات االولية المعتمدة ت‬.‫مرجان الحقيقية لتحقيق نسبة دقة مقبولة بالمقارنة مع نسب تصنيف خوارزميات اخرى‬
‫علما انه يتم اعتماد‬،‫ثم تطب ق القواعد األولية للحصول على نسبة المئوية للتصنيف االولى‬،‫تقسيمها إلى بيانات مرضى وغير مريضى‬
‫ القواعد تنمو مستندة على طريقة التوحيد التي‬،‫ في الخطوة النهائية‬،‫مدى انطباق القواعد على البيانات كمؤشر الهمية تلك القاعدة‬
‫النتائج المكتسبة كانت صغيرة في عملية‬.‫ حيث يتم دمج القواعد ذات الرتبة العالية للحصول على قاعدة جديدة‬،‫كل خطوة‬
ّ ‫طبقت في‬
‫ في المرحلة النهائية للخوارزمية المحسنة حصلنا على دقة أعلى‬.‫ في حين الدقة ازدادت أثناء عملية دمج القواعد‬،‫تنمية القواعد االولية‬
.‫من الخوارزمية الـ ـقياسية الموجودة‬
.‫ البيانات الطبية‬،‫ تعدين (تنقيب) البيانات‬:‫الكلمات المفتاحية‬
1. Introduction
Data mining (DM) refers to the process of finding interesting patterns and
useful information in databases. The interesting patterns can be used to make
predictions [Fayy96]. The process of data mining is composed of several steps
including selecting data to analyze, preparing the data, applying the data mining
algorithms, and then interpreting and evaluating the results [Witt05].
The concept of data miming is well known in different areas such as business,
engineering, communications, transport, medicine, education etc[Mao11]. Data from
1441
Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015
various activities fields are produced and stored daily, processed, transmitted in
different locations without taking into account their meanings. Managers focus their
activity mainly on finding methods and techniques to organize huge data provided by
transactions or other activities and to extract useful patterns, relations, associations
from data etc. [Andr99, Tudo09].
Classification rule is one of the popular techniques of data mining. It is a data
mining technique used to predict group membership for data instances. Classification
involves finding rules that partition the data into disjoint groups[Chri12, Geor04].
The Repeated Incremental Pruning to Produce Error Reduction (RIPPER) is a
classification algorithm designed to generate rules set directly from the training
dataset. The name is drawn from the fact that the rules are learned incrementally. A
new rule associated with a class value will cover various attributes of that class .The
algorithm was designed to be fast and effective when dealing with large and noisy
datasets compared to decision trees [Duma11] .
In this paper we propose several enhancementson the RIPPER data mining
algorithm in order to use in classify two diabetes dataset according to obtained mining
useful patterns. The enhanced classification algorithm applies on the two Diabetes
Datasets in order to satisfy acceptable accuracy percentage of classification rate
compare with other existing classification algorithms.
2. Ruled-based Classification Algorithms.
Rule-based classifiers classify data by using a collection of “if . . . then . . .”
rules.The rule antecedent or condition is an expression made of attribute
conjunctions.The rule consequent is a positive or negative classification.A rule r
covers a given instance x if the attributes of the instance satisfy the rule condition.
The coverage of a rule as the fraction of records that satisfy its antecedent. On the
other hand, its accuracy as the fraction of records that satisfy both the antecedent and
the consequent. Finally the classifier has exhaustive rules if they account for every
possible combination of attribute values –i.e. each record is covered by at least one
rule.RIPPER and CN2 are a good example of such algorithms. The advantages of
rule-based classifiers are that they are extremely expressive since they are symbolic
and operate with the attributes of the data without any transformation. Rule-based
classifiers, and by extension decision trees, are easy to interpret, easy to generate and
they can classify new instances efficiently [Amat11].
3. RIPPER classification algorithm.
The name is drawn from the fact that the rules are learned incrementally. A
new rule associated with a class value will cover various attributes of that class .The
algorithm was designed to be fast and effective when dealing with large and noisy
datasets compared to decision trees [Duma11].
For two-class problems, RIPPER chooses the majority class as its default class
and learns the rules for detecting the minority class. For multiclass problems, the
classes are ordered according to their frequencies. It is illustrated by the following
figure (1):
1442
Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015
Algorithm: Ripper Algorithm.
Input: Training dataset S with n instances and m attributes
Output: Rule set
Begin
Sort classes in the order of least prevalent class to the most prevalent class.
Create a new rule set
while iterating from the prevalent class to the most prevalent class
split S into into Spos and Sneg
while Spos is not empty
split SPos and Sneg into Gpos and Gneg subsets and Ppos and
Pnegsubsets.
create and prune a new rule
if the error rate of the new rule is very large then
end while
else
add new rule to rule set
the total description length l is computed
if l> d then
end while
end while
end while
End
4.The proposed classification system.
Figure (1): Ripper Algorithm.
The structure of the proposed system gathered between the data mining
process stags and using of enhanced RIPPER algorithm. This integration resulted in
the system presented in figure (2).
1443
Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015
Medical Datasets
The selected Database
Preprocessed data Stage
Feature& Rules SelectionStage
Mining ProcessStage
Model Evaluation
Part of Data Test
Classification Results
Figure (2): Block diagram of the proposed classification system
4.1 Preprocessed Data Stage.
We select the two sets of medical data sets that are used in this study namely
Pima Indian Diabetes Data Set and Mirjan hospital Diabetes Data Set.Pima data set
consists of 768 instances with no missing value attributes, and it consists of 9
attributes plus the class attribute. The name for each attributes or features (all numeric
values).While the Mirjan hospital data sets consist of 160 instances with no missing
value attributes, and it consists of 11 attributes plus the class attribute that may be 1 or
2.All these attributes can be shown in the following sections.
We have described the data used in the experiments for data sets, each stage
of data pre-processing activities are datacleaning phase that involves checking for
any missing values in the data set, smooth noisy data, identify or remove outliers, and
remove inconsistencies. For our diabetes data sets, cleaning data is not necessary
because we have a cleaned data. Also we have used the two transformation steps:
feature selection (6) out of (9) attributes to Pima data and (5) out of (11) attributes to
Mirjan data for further processing of enhance RIPPER algorithm and aggregation
steps for pruning and construct new rule.
1444
Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015
4.2 Features and Rules Selection Stage.
In the previous stage, the visualization charts indicate of how to choose the
features that are most popular to use in mining algorithm. According the World
Health Organization (WHO), the most popular (6) features out of 9 are : ((Glucose
concentration), (2-Hour serum insulin) , (Body mass index) ,(Diabetes PF) ,(Age) and
(Pregnant)). And the most popular (5) feature out of (11) for the Mirjan medical data
set are selected :((hbA1c) , (waist) ,(TG) ,(HDL) and (microalbuminurea)) .The
selected rules are based on the manually process of the trial and error of dataanalysis.
Table (1): The initial rules to Pima-Indians-Diabetes
Rule
number
Rule#1
Rule#2
Rule#3
Rule#4
Rule#5
Rule#6
Rule#7
Condition rule
Action rule
IF(Glucose concentration is high) & (2-Hour serum
insulin is low) & (Body mass index is high) &
(Diabetes PF is high) & (Age is low)
Class 2
IF(Glucose concentration is medium) & (2-Hour
serum insulin is low) & (Body mass index is
medium) & (Diabetes PF is high) & (Age is
high)
IF(Pregnant is high) & (Glucose concentration is
medium) &(Body mass index is high)
&(Diabetes PF is high) & (Age is high)
IF(Glucose concentration is high) &(2-Hour serum
insulin is low) & (Body mass index is high) &
(Diabetes PF is low) &(Age is medium)
IF(Glucose concentration is high) & (Body mass
index is low) & (Age is medium)
If (Glucose concentration is medium) & (Body mass
index ishigh) &(Age is low)
IF(Pregnant is high) &(Glucose concentration is
high) & (2-Hour serum insulin is low) &(Body
mass index is high) &(Diabetes PF is high) &
(Age is high)
Class 2
Class 2
Class 2
Class 2
Class 2
Class 2
The low, high and medium value for each attribute is illustrated in the
following table.
Table (2): The low, high and medium values for each attribute to Pima-IndiansDiabetes
Feature name
Low
High
Medium
Number of times pregnant
0
17
8
Plasma glucose concentration
100
126
70
Diastolic blood pressure
70
>80
80
Triceps skin fold thickness
<24
>30
24
2-Hour serum insulin
<100
>=200
[100-199]
Body mass index
<24.9
>25
[24.2 – 18.5]
Diabetes edigree
<0.088
>2.42000008
2.329
Age
<21
>30
[21-29]
1445
Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015
The same processes have used with Mirjan Hospital Diabetes Data sets to
produce the initial rules and the low, high and medium values for each attribute.
4.3 Mining Processes Stage.
It consists of three steps:1. Enhanced RIPPER Classification algorithm
The Enhanced Ripper algorithm is illustrated by the following figure (3).
Algorithm: Enhanced Ripper Algorithm.
Input: Training dataset S with n instances and m attributes
Output: Rules sets
Begin
Sort classes in the order of least prevalent class to the most prevalent
class.
While iteration for all frequencies of initial rule sets
Split S into Spos and Sneg
While Spos is not empty
Split Spos into Ppos and Pneg
Create and Prune a new rule
If (m’s are equal in two selected rules)
Continue in classification
Else
Select uniformed attributes in specified two rules,
Return high rank attribute.
Check error rate (Information Gain) for each rule as indicator .
Add the new rule
End While
End While
Return new rules
End.
Figure (3): EnhancedRipper algorithm.
In the first step, it shows the input and output parameters, then iterations of all
rules set are performed for all data set. The Data Set (S) is split to Spoitive and
Snegative (Spos and Sneg). The next step illustrates of how rule punning process and
stop condition is constructed based on these enhanced idea as shown in the following.
1446
Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015
2. Rule Pruning
This step is most important step of enhanced RIPPER algorithm. Each rule’s
attributes are checked with the other seven initial rules. To make the aggregation
process, the rule is aggregated based on the high rank of attribute extraction above the
main diameter. This process is performed because the initial rules are variant in
attributes values. However, the pruning rules are variant too. The accuracy detection
of the pruned rule is achieved in high percentage. The following table illustrates the
pruning process between each rule with other rules to Pima-Indians-Diabetes data
sets.
Table (4): Pruning Process of enhanced RIPPER algorithm to Pima-IndiansDiabetes data sets.
Rule n,m Rule# Rule# 2 Rule# 3 Rule# 4 Rule# 5 Rule# 6 Rule# 7
1
Rule# 1
R 1,1
R 1,2
R 1,3
R 1,4
R 1,5
R 1,6
R 1,7
Rule# 2
R 2,1
R 2,2
R 2,3
R 2,4
R 2,5
R 2,6
R 2,7
Rule# 3
R 3,1
R 3,2
R 3,3
R 3,4
R 3,5
R 3,6
R 3,7
Rule# 4
R 4,1
R 4,2
R 4,3
R 4,4
R 4,5
R 4,6
R 4,7
Rule# 5
R 5,1
R 5,2
R 5,3
R 5,4
R 5,5
R 5,6
R 5,7
Rule# 6
R 6,1
R 6,2
R 6,3
R 6,4
R 6,5
R 6,6
R 6,7
Rule# 7
R 7,1
R 7,2
R 7,3
R 7,4
R 7,5
R 7,6
R 7,7
The initial rules depends on six attributes(1,2, 5, 6,7 and 8):((Pregnant )
(Glucose concentration), (2-Hour serum insulin) , (Body mass index) ,(Diabetes PF)
and (Age) ).For example, the aggregation process between rule #1 and rule #2:
Rule #1: IF( Glucose concentration is high) & (2-Hour serum insulin is low) & (Body
mass index is high) & (Diabetes PF is high) & (Age is low).
Rule #2: IF(Glucose concentration is medium) & (2-Hour serum insulin is low) &
(Body mass index is medium) & (Diabetes PF is high) & (Age is high).
Pruned rule:IF(Glucose concentration is medium) & (2-Hour serum insulin is low) &
(Body mass index is high) & (Diabetes PF is high) & (Age is low).
This pruning process in this step considers the best pruning and aggregation to
derivative the new rules. Also the same processing performs on the rules of Mirjan
Hospital Diabetes Data sets.
4.4 Model evaluation and classification results.
In this stage, we evaluate the suggested enhance RIPPER algorithm by testing
500 non-patient class records on a new derivative rules. The accuracy detection was
(75.37%) for the training stage in the pruning process. The testing process in the
evaluation model was (94.4%) with 0.056% false negative rate for 28 remaining data
set. Accuracy is calculated as:
Accuracy = (TP+TN)/(TP+TN+FN+FP)
FN: False Negative
1447
…. (1)
Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015
FP: False Positive
Accuracy (Training) = 202 + 0 / 200 + 0+66+0= 75.37%
Accuracy (Testing) = 472 + 0 / 472 +0+28+0= 94.4%
For the Mirjan data sets ,the accuracy detection was (88.33%) for the training
stage in the pruning process. The testing process in the evaluation model was (100%)
with 0.0% false negative rateof remaining data set as the following:
Accuracy (Training) = 106 + 0 / 106 + 0+0+14= 88.33%
Accuracy (Testing) =s 40 + 0 / 40 +0+0+0= 100%
5. Performance comparison with other Data Mining Algorithms.
In this section the accuracy for proposed algorithm compared with several
algorithms such as RIPPER , JRIP , Belief Network(Laplace), Decision Tree , Naïve
Bayes , C4.5 and FURIA that used Pima Indians Diabetes data set. Table (5)and
figure (4) show the estimation of classification accuracies on the testing data.
Table (5): A comparison of enhanced RIPPERwith other algorithms that used PimaIndians-Diabetes Data Set.
Algorithm
Accuracy
Enhanced RIPPER
94.4%
RIPPER
74.56
JRIP
72.00
Belief Network(Laplace)
72.50%
Decision Tree
Naïve Bayes
C4.5
FURIA
72.00%
71.50%
73.43
74.71
Accuracy
100%
80%
60%
40%
20%
0%
Figure (4) Accuracy of Enhanced RIPPER compared with other algorithms that
used Pima-Indians-Diabetes Data Set.
1448
Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015
6. Conclusions
Designing and implementing enhanced RIPPER algorithm and apply it in a
medical data set resulted in several conclusions:
1. The rule-based classifiers such as Ripper or enhanced Ripper are well suited for
handling data sets with unbalanced class distributions that work on it.
2. Concerning the benefits of the system:
a. The designed system facilitates the work on the medical dataset classification
for patient and non-patient. Non-patient can classify accurately on the testing
phase. Patient can be classified on the training phase.
b. The proposed system is designed to work on behalf of medical staff
completely and does not need user interference.
c. Data mining classification methods produce many benefits concerning
automation, simplicity and performance.
d. The proposed system can implement on a medical dataset without time and
memory consumption.
3. Concerning the implementation part:
a. For enhanced RIPPER algorithm, the resulted pruning rules are very
effective and sufficient. These rules were applied on Mirjan medical data
and give a high rate detection percent.
b. Performance comparison is calculated with other classification methods
such as decision tree and the obtained results overcome the existing one.
7. References
[Amat11] Xavier Amatrian, Alejandro Jaimes,Nuria Oliver, and Josep M.Pujol, Data
Mining Methods for Recommender Systems,book,2011.
[Andr99]E.Andrássoyá, and J.Paralič, Knowledge Discovery In Databases-A
Comparison Of Different Views, International Conference on Information
and Intelligent Systems,1999.
[Chri12] Angeline Christobel. Y, Dr. P. SivaPrakasam, The Negative Impact of
Missing Value Imputation in Classification of Diabetes Dataset and
Solution for Improvement, IOSR Journal of Computer Engineering
(IOSRJCE),Volume 7, Issue 4, PP 16-23, Nov. - Dec. 2012.
[Duma11]Mlungisi Duma, BhekisiphoTwala, TshilidziMarwala, Fulufhelo V.
Nelwamondo ,IMPROVING THE PERFORMANCE OF THE RIPPER
IN INSURANCE RISK CLASSIFICATION: A COMPARITIVE
STUDY USING FEATURE SELECTION, 8th International Conference
on Informatics in Control, Automation and Robotics ,2011.
[Fayy96]Usama Fayyad, Gregory Piatetsky, and Padhraic Smyth, From Data Mining
to Knowledge Discovery in Databases, AI-magazine, vol.17, no.3, p.p.3754, 1996.
[Geor04]Arun George Eapen, Application of Data mining in Medical
Applications,Master thesis , University of Waterloo , 2004.
[Folo03] S.O. Folorunso, A.B. Adeyemo, Alleviating Classification Problem of
Imbalanced Dataset, African Journal of Computing & ICT , Vol 6. No.
2, June 2013.
[Hühn 09]Jens Hühn and EykeHüllermeier, FURIA: An Algorithm For Unordered
Fuzzy Rule Induction, Springer US, Volume 19, Issue 3, pp 293-319,
December 2009.
[Mao11] Yi Mao, , Yixin Chen, Gregory Hackmann, Minmin Chen, Chenyang Lu,
Marin Kollef, and Thomas C. Bailey, Medical Data Mining for Early
1449
Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015
Deterioration Warning in General Hospital Wards, Proc. ICDM
Workshop on Biological Data Mining and its Applications in Healthcare
(BioDM-11),2011.
[Tudo09] Irina Tudor, LiviluIonita,Intelligent Agents as Data Mining Techniques
Used in Academic Environment, The 4th International Conference on
Virtual Learning ICVL, 2009.
[Vija11] S.Vijayarani, M.Divya, An Efficient Algorithm for Generating
Classification Rules Hiding, International Journal of Computer Science
and technology, Vol.: 2, Issue 4, OCT.- DeC. 2011.
[Witt05] Ian H.Witten, Eibe Frank, Data Mining: Practical Machine Learning
Tools and Techniques,Elsevier,2005.
1450