Download Predictions an classification capabilities Decision

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Editors – Prof. Amos DAVID & Prof. Charles UWADIA
1
PREDICTION AND CLASSIFICATION CAPABILITIES OF DECISION TREE
ALGORITHMS IN MODELLING
ADEYEMO OMOWUNMI,
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF IBADAN, NIGERIA
ADEWOLE PHILLIP
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF LAGOS, NIGERIA
OGUNBIYI DOYINSOLA
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF LAGOS, NIGERIA
SAMSON ONI
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF IBADAN, NIGERIA
ABSTRACT
Decision tree is a data mining technique that can accurately classify data and make effective predictions, it has been
successfully employed for data analyses as a comprehensible knowledge representation in a broad range of fields such as
customer relationship management, engineering, medicine, agriculture, computational biology, business management,
fraudulent statement detection. In customer relationship management, findings discovered by decision trees model created are
useful for understanding customers’ needs and preferences. In engineering, decision tree models have been discovered very
useful in identifying the relationships between a household and its electricity consumption and also a high degree of
classification accuracy in detection of faults. In medicine, decision tree has been discovered a useful tool to discover and
explore hidden information in health-care management, for cancer and heart disease prediction, prediction of out of hospital
cardiac arrest, prediction of chances of occurrence of a disease in an area etc. In agriculture for prediction of soil fertility, to
investigate the relationship between pasture production and environmental and management factors and so on. In this paper, we
provide a review of research publications that have explored the accuracy of the prediction and classification capabilities of
decision tree to achieve data mining in comparison with several other algorithms in different application domains and since data
mining takes advantage of the large set of data that is available to carry out prediction activity we used data consisting of
records of Heart disease patients that have been gathered over the years and due processing is performed on them using
Decision Tree, an approach to achieving data mining.
KEYWORDS: Decision trees, Data mining, Heart disease, Classification and Prediction
1. INTRODUCTION
Data mining refers to the analysis of large set of data obtained as a result of some activities that have taken
place over time with the aim of revealing the hidden pattern (useful information) in the set data. Data mining
identifies trend within data in a manner that goes beyond ordinary analysis (Olugbenga Oluwagbemi,
Uzoamaka Ofoezie, Nwinyi Obinna, 2012)
Data mining in healthcare is an emerging field of high importance for providing prognosis and a deeper
understanding of medical data. Data mining applications in healthcare include analysis of health care centres
for better health policy-making and prevention of hospital errors, early detection, prevention of diseases and
preventable hospital deaths, more value for money and cost savings, and detection of fraudulent insurance
claims (Ruben 2009).
Predicting the outcome of a disease is one of the most interesting and challenging tasks in which to develop
data mining applications.
Researchers are using data mining techniques in the medical diagnosis of several diseases such as diabetes
(Porter and Green 2009), stroke (Panzarasa, Quaglini et al. 2010), cancer (Li L 2004), and heart disease
(Das, Turkoglu et al. 2009).
Kangwanariyakul et al., (2010), Patil and Kumaraswamy, (2009) have tried to apply data mining techniques
in the diagnosis of heart disease. Different classification methods such as Neural Networks and Decision
Trees were applied to predict the presence of heart disease and to identify the most significant factor which
contributes for the cause of the disease, while association rule discovery was used to identify the effect of
diet, lifestyle, and environment on the outcome of the disease.
Decision tree is a data mining technique which can be used for classification and prediction, it is
used widely because knowledge discovered from it is illustrated in a hierarchical structure which
makes it to be easily understood by people who are not experts in data mining.
It is a predictive modeling based technique developed by Rose Quinlan. It is a sequential classifier
in the form of recursive tree structure. There are three kinds of nodes in the decision tree. The node
from which the tree is directed and has no incoming edge is called the root node. A node with
outgoing edge is called internal or test node. All the other nodes are called leaves (also known as
terminal or decision node). The data set in decision tree is analyzed by developing a branch like
structure with appropriate decision tree algorithm. Each internal node of tree splits into branches
based on the splitting criteria. Each test node denotes a class. Each terminal node represents the
decision. They can work on both continuous and categorical attributes. Manpreet Singh et. al.
(2013).
1.2 PROCESSES OF DEVELOPING A DECISION TREE MODEL
A common way to create a decision tree model is to employ a top-down, recursive, and divide-andconquer approach. Such a modelling approach enables the most significant attribute to be located at
the top level as a root node and the least significant attributes to be located at the bottom level as
leave nodes. Each path between the root node and the leave node can be interpreted as an ‘if-then’
rule, which can be used for making predications.
The modeling process creation of a decision tree can be divided into three stages which are
explained below. Mutasem Sh. Alkhasawneh et.al, (2012)
1.2.1 TREE GROWING
The initial stage of creating a decision tree model is tree growing, which includes two steps: tree
merging and tree splitting. At the beginning, the non-significant predictor categorizes and the
significant categories within a dataset are grouped together (tree merging). As the tree grows,
impurities within the model will increase. Since the existence of impurities may result in reducing
the accuracy of the model, there is a need to purify the tree. One possible way to do it is to remove
the impurities into different leaves and ramifications (tree splitting). Mutasem Sh. Alkhasawneh
et.al, (2012)
1.2.2 TREE PRUNING
Tree pruning, which is the key elements of the second stage, is to remove irrelevant splitting nodes.
The removal of irrelevant nodes can help reduce the chance of creating an over-fitting tree. Such a
procedure is particularly useful because an over-fitting tree model may result in misclassifying data
in real world applications. Mutasem Sh. Alkhasawneh et.al, (2012)
1.2.3 TREE SELECTION
The final stage of developing a decision tree model is tree selection. At this stage, the created
decision tree model will be evaluated by either using cross-validation or a testing dataset. This stage
is essential as it can reduce the chances of misclassify-ing data in real world applications, and
consequently, minimise the cost of developing further applications. Mutasem Sh. Alkhasawneh
et.al, (2012)
1.3 Problem Statement
WHO, (2011) reported Cardiovascular Diseases (CVDs) are the number one cause of death globally: more
people die annually from CVDs than from any other cause. An estimated 17.1 million people died from
CVDs in 2004, representing 29% of all global deaths, of these deaths, an estimated 7.2 million were due to
coronary heart disease which is one of the most common types of heart disease and 5.7 million were due to
stroke.
Cardiovascular disease, CVD, is the number one cause of death globally, claiming 17.3 million lives each
year with Nigeria having its fair share.
State of heart disease in Nigeria and Africa Cardiovascular Disease, CVD, is the number one killer disease
anywhere in the world even in the developing countries. Particularly, hypertension is the number one
cardiovascular disease in Nigeria, central, southern and West Africa.
Another report by WHO shows that Cardiovascular Disease and Ischemic heart disease together account 6%
of total deaths in Nigeria for all ages, which makes them the 7th and 8th deadliest diseases in Nigeria and
persons dying from heart disease are expected to grow drastically partly as a result of increasing longevity,
urbanization, lifestyle changes, work culture changes and food habits changes (WHO, 2006).
In order to decrease mortality from heart diseases there should be a fast and effective detection method
especially, in developing countries like Nigeria where there is a shortage of specialists and wrongly
diagnosed cases are high. Data mining can be a convenient tool to assist physicians in detecting the disease
by obtaining knowledge and information regarding the disease from patient’s data.
Data mining have shown a promising result in prediction of heart disease. It is widely applied for prediction
or classification of different types of heart disease. For example, different data mining techniques were
applied for prediction of ischemic heart disease and diagnosis of coronary artery disease (Tsipouras and
Fotiadis, 2008; Kangwanariyakul et al., 2010). These successful studies which are conducted abroad have
motivated this study to tackle the underlying problem that exists in our country related to heart disease
diagnosis.
The purpose of this study is, therefore to apply data mining techniques for extracting hidden patterns, which
are significant to heart diseases, from data collected from University College Hospital Ibadan, Nigeria.
1.4 RESEARCH OBJECTIVES
Heart disease is a disease that has claimed several lives in Nigeria, Africa and the World at large. Though
there is a standard treatment for it, but it is very expensive and delicate, so it is important to adopt fast and
reliable means of predicting or detecting the disease so that it will be possible to eradicate it.
With the use of a decision making system that implements Decision Tree (which predictive capability in the
heart disease prediction and some other domain is critically reviewed in this paper), heart disease could be
eradicated or reduced to a very minimal level in Nigeria.
2. DECISION TREE ALGORITHMS
The different decision tree algorithms are ID3, C4.5, C5.0, CHAID, and CART.
The Decision tree algorithm differs in the following ways:
Capability of modeling different types of data e.g categorical data, numerical data, or the
combination of both(All of the above-mentioned algorithms can support the modeling of
categorical data whilst only the C4.5 algorithm and the CART algorithm can be used for the
modeling of numerical data)
The process of model development, especially at the stages of tree growing and tree pruning(the
ID3 and C4.5 algorithms split a tree model into as many ramifications as necessary whereas the
CART algorithm can only support binary splits. The pruning mechanisms located within the C4.5
and CART algorithms support the removal of insignificant nodes and ramifications but the CHAID
algorithm hinders the tree growing process before the training data is being overused).
2.1 ID3 (ITERATIVE DICHOTOMISER)
ID3 is a greedy learning decision tree algorithm introduced in 1986 by Quinlan Ross. Quinlan, J. R.
(1986). It is based on Hunts algorithm .This algorithm recursively selects the best attribute as the
current node using top down induction. Then the child nodes are generated for the selected
attribute. It uses an information gain as entropy based measure to select the best splitting attribute
and the attribute with the highest information gain is selected as best splitting attribute. The
accuracy level is not maintained by this algorithm when there is too much noise in the training data
sets. The main disadvantage of this algorithm is that it accepts only categorical attributes and only
one attribute is tested at a time for making decision. The concept of pruning is not present in ID3
algorithm. Manpreet Singh et.al 2013.
2.2 C4.5 ALGORITHM
C4.5 is an extension of ID3 algorithm developed by Quinlan Ross. Anyanwu , M. N., & Shiva, S.
G. (2009). This algorithm overcomes the disadvantage of overfitting of ID3 algorithm by process of
pruning. It handles both the categorical and the continuous valued attributes. Entropy or
information gain is used to evaluate the best split. The attribute with the lowest entropy and highest
information gain is chosen as the best split attribute. The pessimistic pruning is used in it to remove
unnecessary branches indecision tree. Manpreet Singh et.al 2013.
2.3 C5.0 ALGORITHM
C5 algorithm is an improvement over C4.5 algorithm. Ruggieri, S. (2002, April). It uses the
concept of maximum gain to find the best attribute. It can produce classifiers which can be
represented in the form of rule sets or decision trees. C5.0 rule sets are small in size thus they are
not highly prone to error, it has great speed efficiency as compared to C4.5, it produces simple and
small decision trees
And also adopts a boosting technique to calculate accuracy of data. It is a technique for creating and
combining multiple classifiers to generate improved accuracy. Another characteristics is that it
supports sampling and cross validation. Manpreet Singh et.al (2013)
3. DECISION TREE CLASSIFICATION AND PREDICTION
A decision tree is a simple representation for classifying examples. Decision tree learning is one of
the most successful techniques for supervised classification learning. Assuming that features we are
building the decision trees have finite discrete domains, and there is a single target feature called
the classification. Each element of the domain of the classification is called a class. A decision tree
or a classification tree is a tree in which each internal (non-leaf) node is labeled with an input
feature. The arcs coming from a node labeled with a feature are labeled with each of the possible
values of the feature. Each leaf of the tree is labeled with a class or a probability distribution over
the classes.(Wikipedia 2014)
Each interior node of a Decision tree corresponds to one of the input variables; there are edges to
children for each of the possible values of that input variable. Each leaf represents a value of the
target variable given the values of the input variables represented by the path from the root to the
leaf. (Wikipedia 2014)
A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This
process is repeated on each derived subset in a recursive manner called recursive partitioning. The
recursion is completed when the subset at a node has all the same value of the target variable, or
when splitting no longer adds value to the predictions. This process of top-down induction of
decision trees (TDIDT) is an example of a greedy algorithm, and it is by far the most common
strategy for learning decision trees from data. Wikipedia (2014)
3.1 ALGORITHM FOR DECISION TREE INDUCTION
Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive divide-and-conquer manner
-At start, all the training examples are at the root
-Attributes are categorical (if continuous-valued, they are discretized in advance)
-Examples are partitioned recursively based on selected attributes
-Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
-All samples for a given node belong to the same class
-There are no remaining attributes for further partitioning –majority voting is employed for
classifying the leaf
-There are no samples left Jiawei Han, (2006)
A decision tree for the concept play Tennis, An example is classified by sorting it through the
tree to the appropriate leaf node, then returning the classification associated with this leaf(in
this case, Yes or No).
3.3 DECISION TREE APPLICATIONS
Decision tree has been used to develop models for prediction and classification in different domains
some of which are Business Management, Customer Relationship Management, Fraudulent
Statement Detection Engineering Energy Consumption, Fault Diagnosis, Healthcare Management
,Agriculture as explained in the studies below.
3.3.1. CLASSIFICATION
D.Senthil Kumar Et al, in their research focused on the aspect of Medical diagnosis by learning pattern
through the collected data of diabetes, hepatitis and heart diseases and to develop intelligent medical
decision support systems to help the physicians, they proposed the use of decision trees C4.5 algorithm, ID3
algorithm and CART algorithm to classify these diseases and compare the effectiveness, correction rate
among them.
Mohd Najwadi Yusoff and Aman Jantan, 2011 proposed the usage of Genetic Algorithm (GA)
as an approach to optimize Decision Tree (DT) in malware classification in comparison with
Current techniques in malware classification, they discovered that current techniques do not give a
good classification result while dealing with new as well as unique types of malware which are
highly specialized and very difficult to classify. GA was chosen because unique types of malware
are basically functioning like crossover and permutation operations in GA. New classifier was
developed by combining GA with DT and named Anti-Malware System (AMS) Classifier. Two
experiments were conducted in order to prove their approach. The first experiment was done using
DT Classifier and the second experiment was done using AMS Classifier the proposed approach.
Both experiments were tested using 200 sample PE files with four chosen threshold value for both
experiments. Experimental results obtained from AMS Classifier and DT were compared and
visualized in tables and graphs. Their result shows AMS Classifier shows an accuracy increase
from 4.5% to 6.5% from DT Classifier. The outcome of their paper is a new Anti-Malware
Classification System (AMCS) which consists of AMS Classifier and new malware classes that
were named Class Target Operation (CTO). Malware was classified by using CTO which are
mainly based on malware target and its operation behavior.
Abolfazl Kazemia ET. Al, 2011 researched the use of “CHIAD”, “CRT”, “QUEST” and “C5.0”
Decision Tree algorithm to help organizations determine the criteria needed for the identification of
potential customers in the competitive environment of their business. Mechanism for the
identification of potential customers liable to becoming real customers was also provided by
Combinations of (CRM ) Customer Relation Management field result and data mining results. the
main criteria are identified and their importance are determined in this paper and then assuming that
each main criterion consists of several sub-criteria, their importance in turning potential customers
into real ones was in turn determined. According to their investigation, organization decision tree
seemed to be a proper tool for identification and classification of the factors for turning potentials
into real customers. The four variables Product introduction, product type, sales expert, and the
product request are most effective in turning potentials into real customers; The more criteria used
in creating decision tree, the more easily customers are identified. The tree obtained based on C5.0
algorithm provided the most optimal variable and decision tree by 83.96% accuracy which is closer
to field results used for the comparison and performs better in action.
Baisen Zhang Tillman, Russ 2007 investigated the potential of a decision tree approach for
modelling NFUE in New Zealand pastures. Their decision tree model suggested that the time of
applying N fertilizer was the most important factor influencing NFUE, with August or September
(early spring in New Zealand) being the best time of application. The interaction of rainfall and
temperature, rainfall, phosphorus (P) fertilizer history, soil Olsen P and slope were other important
factors influencing NFUE. The researchers validated their models for 11 of the 16 trials tested with
a predictive accuracy of 69%. The mechanisms by which these factors influenced NFUE and the
uncertainty associated with the model prediction were discussed. It was concluded that this type of
modelling approach can be used to predict NFUE and thereby to assist decisions on when and where
to apply N fertilizer in pastures for increasing productivity while reducing the environmental
impact.
Abishek Suresh, Et. Al. Investigated the application of decision tree models for the formation of
protein homodimer complexes for molecular catalysis and regulation is fascinating. According to
the researchers, the homodimer formation through 2S (2 state), 3SMI (3 state with monomer
intermediate) and 3SDI (3 state with dimer intermediate) folding mechanism is known for 47
homodimer structures. The dataset of forty-seven homodimers used consists of twenty-eight 2S,
twelve 3SMI and seven 3SDI. The dataset is characterized using monomer length, interface area and
interface/total (I/T) residue ratio. It is found that 2S are often small in size with large I/T ratio and
3SDI are frequently large in size with small I/T ratio. Nonetheless, 3SMI have a mixture of these
features. Hence, these parameters were used to develop a decision tree model. The decision tree
model produced positive predictive values (PPV) of 72% for 2S, 58% for 3SMI and 57% for 3SDI
in cross validation. It was thus concluded that the method finds application in assigning
homodimers
with
folding
mechanism.
Mahjoobi, J. 2007 studied the performances of Decision trees classification for prediction of wave
parameters which are necessary for many applications in coastal and offshore engineering. The data
set used in this study comprises of wave data and over water wind data gathered from deep water
location in Lake Ontario. The data set was divided into two groups. The first one that comprises of
26 days wind and wave measurement was used as training and checking data to develop tree
models. The second one that comprises of 14 days wind and wave measurement was used to verify
the models. Training and testing data include wind speed, wind direction, fetch length and wind
duration as input variables and significant wave heights as output variable. The wave heights for
whole data set are grouped into wave height bins of 0.25 m. Then a class is assigned to each bin.
For evaluation of the developed model the mean of each class is compared with the observed
data.According to the researchers several and various prediction models have been proposed in the
literature for this purpose, decision tree models was found to give a better accuracy.
Wang Wei, 2012, In his study, used decision tree to classify, Landsat ETM+ image of
Huainan
city in Anhui, which was established based on the analysis of the spectrum characteristics, the
texture characteristics and other auxiliary information, such as NDVI, NDBI and topography
characteristics. They also compared decision tree classification technology with maximum
likelihood classification method. The result of their study indicated that the accuracy of decision
tree classification was 4.06% higher than that of the maximum likelihood classification and Kappa
coefficient was increased by 5.61%. These show that decision tree classification technology is
flexible and can improve the classification accuracy efficiently.
Kuldeep Kumar, Et. Al 2006 in their study discussed the effectiveness of using decision trees for
mass classification in mammography. The decision tree algorithms implemented by CART
(Classification and Regression Trees) and See5 were used for the experiments. Different costs for
type I and type II misclassification were applied for the experiments. The results obtained using
algorithms based on decision trees were compared with that produced by neural network which was
reported giving the higher classification rate than statistical models, with higher standard deviation.
It was concluded that the decision trees are very promising for the classification of breast masses in
digital mammograms.
Michael D. Twa, 2011 described
the application of decision tree induction, an automated machine
learning classification method, to discriminate between normal and keratoconic corneal shapes in an
objective and quantitative way in other to solve with the aim of providing solution to the challenge of
interpretation of volume and complexity of data produced during videokeratography examinations. In their
research the proposed method was compared with other known classification methods and decision tree
classifier performed equal to or better than the other classifiers tested: accuracy was 92% and the area under
the ROC curve was 0.97 and also decision tree classifier reduced the information needed to distinguish
between normal and keratoconus eyes using four of 36 Zernike polynomial coefficients. The four surface
features selected as classification attributes by the decision tree method were inferior elevation, greater
sagittal depth, oblique toricity, and trefoil. So it was concluded that automated decision tree classification of
corneal shape through Zernike polynomials is an accurate quantitative method of classification that is
interpretable and can be generated from any instrument platform capable of raw elevation data output.
Gregor Stiglic, ET. Al. 2012, in their research, presented an extension to an existing machine
learning environment and a study on visual tuning of decision tree classifiers. The motivation for
their research came from the need to build effective and easily interpretable decision tree models by
so called one-button data mining approach where no parameter tuning is needed. Bias in
classification was avoided by not using any classification performance measure during the tuning of
the model that is constrained exclusively by the dimensions of the produced decision tree.The
proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine
learning problems and 31 datasets from the field of bioinformatics. The results demonstrate a
significant increase of accuracy in fewer complexes visually tuned decision trees. In contrast to
classical machine learning benchmarking datasets, higher accuracy gains were observed in
bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that
the tree tuning times are significantly lower for the proposed method in comparison to manual
tuning of the decision tree.
Peng Du, Ding Xiaoqing 2008, in their research presented a method based on decision tree
classifier to identify the gender of a person. Considering that the feature of gender may be related to
the feature of ethnicity, the tree was designed to decide ethnicity first and decide gender with its
corresponding ethnicity. The method was implemented in a system that mainly consists of three
parts, face detection, feature extraction and gender classification with a decision tree. Comparative
study on the effects of tree classifier and an ordinary classifier that does not implement a decision
tree was also reported on a data set of 1928 face images. The result of their research shows that the
performance of decision tree classifier is superior to the ordinary classifier.
Felipe Liraa, 2013 in their research developed a decision tree model, which indicated the action
range of peptides on the types of microorganisms on which they can exercise biological activity in
other to assist in the recent attempts to find effective substitutes to combat infections that have been
directed at identifying natural antimicrobial peptides in order to circumvent resistance to
commercial antibiotics. Their study described the development of synthetic peptides with
antimicrobial activity, created in silico by site-directed mutation modelling using wild-type peptides
as scaffolds for these mutations. Fragments of antimicrobial peptides were used for modeling with
molecular modeling computational tools. Their decision tree model was processed using
physicochemistry properties from known antimicrobial peptides available at the Antimicrobial
Peptide Database (APD). The results of their study showed that the use of decision trees to evaluate
the antimicrobial activity of synthetic peptides enables the creation of more effective models for use
in the development of new drugs, using known peptides as scaffolds for designing new compounds,
and reducing the cost and time required for research.
2. PREDICTION
Jay Gholap, 2013 used attribute selection and boosting meta-techniques to tune the performance of
J48 decision tree algorithm on the large amounts of data that are harvested along with the crops in
predicting the soil fertility class since achieving and maintaining appropriate levels of soil fertility,
is of utmost importance if agricultural land is to remain capable of nourishing crop production. In
his research, Steps for building a predictive model of soil fertility was explained. The result of his
study shows that several decision tree algorithm has been used for this prediction but J48 gives
91.90 % accuracy, hence it can be used as a base learner. He also discovered that with the help of
other meta-algorithms like Attribute selection and boosting, J48 gives accuracy of 96.73% which
makes a good predictive model in predicting the soil fertility in agriculture.
Mohammad Taha Khan ET. Al. 2012 primarily researched the application of two decision tree
algorithms C4.5 and the C5.0 was used for breast cancer as well as heart disease prediction. The
data used is the Public-Use Data available on web, consisting of 909 records for heart disease and
699 for breast cancer for prediction and performance of both algorithms is compared. The secondary
application of the research presents how these rules can be used in evidence based medicine. (EBM)
which is a new and important approach which can greatly improve decision making in health care.
EBM's task is to prevent, diagnose and medicate diseases using medical evidence. Total eight Rules
were generated by using C4.5 and C5.0 from cancer data set after pruning at the Confidence level
50. Running the C5.0 on heart disease data set seven rules has been generated. The authors
concluded in evaluating the performance of these two algorithms, C5.0 handles missing values
easily but C4.5 shows some errors due to missing values. Over running the dataset of breast cancer
of 400 records C4.5 shows 5 train error whereas C5.0 show only 3 train errors. C5.0 produces rules
in a very easy readable form but C4.5 generates the rule set in the form of a decision tree.
Yoshikazu Goto, ET. Al. 2010 in their study developed a simple and generally applicable bedside
model for predicting outcomes after cardiac arrest (OHCA).These researchers analyzed data for
390,226 adult patients who had undergone OHCA, from a prospectively recorded nationwide
Utstein-style Japanese database for 2005 through 2009. The primary end point was survival with
favorable neurologic outcome (cerebral performance category (CPC) scale, categories 1 to 2 [CPC 1
to 2]) at 1 month. The secondary end point was survival at 1 month. We developed a decision-tree
prediction model by using data from a 4-year period (2005 through 2008, n = 307,896), with
validation by using external data from 2009 (n = 82,330). A simple decision-tree prediction mode
permitted stratification into four prediction groups: good, moderately good, poor, and absolutely
poor. Their model identified patient groups with a range from 1.2% to 30.2% for survival and from
0.3% to 23.2% for CPC 1 to 2 probabilities. Similar results were observed when this model was
applied to the validation cohort. The authors concluded that on the basis of a decision-tree
prediction model used for four prehospital variables (shockable initial rhythm, age, witnessed arrest,
and witnessed by EMS personnel), OHCA patients can be readily stratified into the four groups
(good, moderately good, poor, and absolutely poor) that help predict both survival at 1 month and
survival with favorable neurologic outcome at 1 month. This simple prediction model may provide
clinicians with a practical bedside tool for the OHCA patient’s stratification in the emergency
department.
SMITHA.T, DR.V.SUNDARAM 2012 studied the application of ID3 algorithm to build a decision
tree model to predict the chances of occurrences of disease in an area by identify the significant
parameters for prediction process. Their supervised classification model was built on a data set. This
model allowed predicting the insolvency of inhabitants well in advance so that the action measures
can be taken against the insolvent inhabitants based on factors seasonal climate, rainfall data, spread
of deadly diseases, water surface temperature, temperature and perception measurement and non
climatic risk factors such as population immunity and control activities, vector abundance, family
history etc. The prediction interval is also a factor for the analysis. 95% of the prediction accuracy
was achieved employing the decision tree classification model in the research which made the
researchers conclude that mostly female inhabitant with a hereditary history living in a poor
environment condition and having an average age of greater than 35 is suffering the disease.
Heiko Milde, ET. Al 1999, In their research, introduced the MAD system which generates decision
trees based on a new method for qualitative electrical circuit analysis. Different resources such as
design data and expert design know-how as well as diagnosis knowledge can easily be integrated
into decision tree generation. Since a decision tree can be generated automatically based on a device
model, the cost for providing, modifying, and maintaining diagnosis equipment can be drastically
reduced and quality management of diagnosis equipment can be facilitated. It was also investigated
that the cost of decision-tree-based fault identification can be reduced because model-generated
decision trees can be optimized. The MAD system was successfully evaluated by integrated these
decision trees into existing STILL diagnosis systems. Since the MAD system grounds decision tree
generation on a model, a systematic way for diagnosis system generation is provided and the
following benefits arise and the result of their investigation is that firstly, cost of diagnosis system
generation, modification, and maintenance is reduced. Secondly, quality management is facilitated.
Thirdly, average decision-tree-based fault identification cost is reduced. Thus, the MAD system is a
generic approach to bridge the gap between (some) basic AI research concepts and industrial
applications. In particular, their new approach towards qualitative reasoning about faults in
electrical circuits has reached a level of achievement so that it can be utilized to generate diagnosis
systems employed in industry.
Atul Kumar Pandey ET. Al 2013 studied the comparison of Pruned J48 Decision Tree with
Reduced Error Pruning Approach prediction model against simple pruned and unpruned approach
using for classifying heart disease based on clinical data of patients and also developed a heart
disease prediction model that can assist medical professionals in predicting heart disease status
based on these clinical features. These researchers selected 14 important clinical features, i.e., age,
sex, chest pain type, trestbps, cholesterol, fasting blood sugar, resting ecg, max heart rate, exercise
induced angina, old peak, slope, number of vessels colored, thal and diagnosis of heart disease and
the result of their investigation shows that the accuracy of Pruned J48 Decision Tree with Reduced
Error Pruning Approach is more better than the simple Pruned and Unpruned approach and also
from the result obtained it was discovered that fasting blood sugar is the most important attribute
which gives better classification against the other attributes but its gives not better accuracy.
A. R. Senthil kumar, ET. Al.2013 Investigated the performance of soft computing techniques in
modelling qualitative and quantitative water resource variables such as streamflow. Models such as
the Multiple Linear Regression (MLR), Artificial Neural Network (ANN), Fuzzy Logic and
decision tree algorithms such as M5 and REPTree for predicting the streamflow at Kasol located at
the upstream of Bhakra reservoir in Sutlej basin in northern India. The input vector to the various
models using different algorithms was derived considering statistical properties such as autocorrelation function, partial auto-correlation and cross-correlation function of the time series. It was
found that REPtree model performed well compared to other soft computing techniques such as
MLR, ANN, fuzzy logic, and M5P investigated in this study and the results of the REPTree model
indicate that the entire range of streamflow values were simulated fairly well. The performance of
the naïve persistence model was compared with other models and the requirement of the
development of the naïve persistence model was also analysed by persistence index.
B.S. ZHANG, ET. Al. 2004 applied Decision tree models to predict annual and seasonal pasture
production and investigated the interactions between pasture production and environmental and
management factors in the North Island hill country. The results showed that spring rainfall was the
most important factor influencing annual pasture production, while hill slope was the most
important factor influencing spring and winter production. Summer and autumn rainfall were the
most important factors influencing summer and autumn production respectively. The decision tree
models for annual, spring, summer, autumn and winter pasture production correctly predicted 82%,
71%, 90%, 88% and 90 % of cases in the model validation. By integrating with a geographic
information system (GIS), according to their investigation, the outputs of these decision tree models
can be used as a tool for pasture management in assessing the impacts of alternative phosphorus
fertilizer application strategies, or potential climate change, such as summer drought on hill pasture
production. This can assist farmers in making decisions such as setting stocking rate and assessing
feed supply.
Sevgi Zeynep Dogan, ET. Al., 2008
In their study compared the performance of three different decision-tree-based methods of assigning
attribute weights to be used in a case-based reasoning (CBR) prediction model. The generation of
the attribute weights is performed by considering the presence, absence, and the positions of the
attributes in the decision tree. This process and the development of the CBR simulation model are
described in the paper. The model was tested by using data pertaining to the early design parameters
and unit cost of the structural system of residential building projects. The CBR results from their
investigation indicated that the attribute weights generated by taking into account the information
gain of all the attributes performed better than the attribute weights generated by considering only
the appearance of attributes in the tree. The study is of benefit primarily to researchers, as it
compares the impact of attribute weights generated by three different methods and, hence,
highlights the fact that the prediction rate of models such as CBR largely depends on the data
associated with the parameters used in the model.
Bark Cheung Chiu, ET. Al.
2013 adopted the used of Input-Output Agent Modelling (IOAM)
which is an approach to modelling an agent in terms of relationships between the inputs and outputs
of the cognitive system together with a leading inductive learning algorithm, C4.5 to build a
subtraction skill modeller, C4.5-IOAM. It models agents' competencies with a set of decision trees.
It was discovered that C4.5-IOAM makes no prediction when predictions from different decision
trees are contradictory and they resulted to proposing three techniques for resolving such situations.
Two techniques involve selecting the more reliable prediction from a set of competing predictions
using a tree quality measure and a leaf quality measure. The other technique merges multiple
decision trees into a single tree. This has the additional advantage of producing more
comprehensible models. Experimental results from their investigation shows in the domain of
modelling elementary subtraction skills, showed that the tree quality and the leaf quality of a
decision path provided valuable references for resolving contradicting predictions and a single tree
model representation performed nearly equally well to the multi-tree model representation.
Lee S, Park I. 2013 in their study, analyzed the hazard to ground subsidence using factors that can
affect ground subsidence and a decision tree approach in a geographic information system (GIS).
The study area was Taebaek, Gangwon-do, Korea, where many abandoned underground coal mines
exist. Spatial data, topography, geology, and various ground-engineering data for the subsidence
area were collected and compiled in a database for mapping ground-subsidence hazard (GSH). The
subsidence area was randomly split 50/50 for training and validation of the models. A data-mining
classification technique was applied to the GSH mapping, and decision trees were constructed using
the chi-squared automatic interaction detector (CHAID) and the quick, unbiased, and efficient
statistical tree (QUEST) algorithms. The frequency ratio model was also applied to the GSH
mapping for comparing with probabilistic model. The resulting GSH maps were validated using
area-under-the-curve (AUC) analysis with the subsidence area data that had not been used for
training the model. The highest accuracy was achieved by the decision tree model using CHAID
algorithm (94.01%) comparing with QUEST algorithms (90.37%) and frequency ratio model
(86.70%). These accuracies are higher than previously reported results for decision tree. Decision
tree methods can therefore be used efficiently for GSH analysis and might be widely used for
prediction of various spatial events.
Middendorf et al. used alternating decision trees to predict whether an S. cerevisiae gene would be
up- or down regulated under particular conditions of transcription regulator expression given the
sequence of its regulatory region. In addition to good performance predicting the expression state of
target genes, they were able to identify motifs and regulators that appear to control the expression
of the target genes.
4.0 METHODOLOGY
This paper uses decision tree ID3 (Iterative Dichotomized 3) data mining algorithm to develop a model after
a critical review has been done on the use of the algorithm. This classification algorithm was selected
because it is very often have potential to yield good results in prediction and classification applications.
4.1 MODULES EXPLANATION (DOCUMENTATION)
A decision tree is a flowchart-like structure in which internal node represents test on an attribute,
each branch represents outcome of test and each leaf node represents class label (decision taken
after computing all attributes). A path from root to leaf represents classification rules.
The java program consists of several packages but ID3 Logic is the package that does the main work.
ID3 Logic package comprises of several classes which include:
DataLoader interface
DecisionNode class
DefaultDataloader class
Example class
Feature class
ID3 class
ResultNode class
HEART DISEASE DATA
Record set with medical attributes was obtained online from a Hospital. With the help of the dataset, the
patterns significant to the heart attack prediction are extracted using the developed ID3 Datamining model.
The records were split equally into two datasets: training dataset and testing dataset. To avoid bias, the
records for each set were selected randomly. The data include values for the following:
RESULTS
A decision tree is a flowchart-like structure in which internal node represents test on an attribute, each
branch represents outcome of test and each leaf node represents class label (decision taken after
computing all attributes). A path from root to leaf represents classification rules.
The java program consists of several packages but ID3 Logic is the package that does the main work.
The system has been built into a jar file which once double-clicked on a system with java run time.
The result page shows result of the prediction which can either be Heart disease Present or Absent.
5.0 CONCLUSION
Data mining provide algorithm and tools for identifying valid, novel, potentially used and ultimately
understandable pattern from data. As demonstrated in this project, data mining is not limited to business;
it has evolved, and continues to evolve, from the intersection of research fields such as machine
learning, pattern recognition, databases, statistics, AI, knowledge acquisition for expert systems. Also it
is heavily used in medical field for patient diagnosis records and rational decision making.
Decision tree-a data mining model developed and employed in this paper was used in predicting the
existence of heart disease in any diagnosed patient.
It starts by preparing the data collected to conform to the format needed by the system. It proceeded by
training the system, having set-up the tree rules and parameters. The system is then tested with a test
data set to be sure of the output.
4. CONCLUSION
Decision tree has been found useful in classification and prediction modeling due to the fact that it
can capability to accurately discover hidden relationships between variables, it is capable of
removing insignificant attributes within a dataset, and also presents knowledge in an hierarchical
structure which makes it to be knowledge to be understandable even by known expert in the field of
data mining,
Twenty three studies published between 1999 and 2014 in more than three application domains
have been studied in this paper and met the minimum criteria for inclusion in our literature review.
Decision tree-a data mining model developed and employed in this paper was used in predicting the
existence of heart disease in any diagnosed patient.
It starts by preparing the data collected to conform to the format needed by the system. It proceeded by
training the system, having set-up the tree rules and parameters. The system is then tested with a test data set
to be sure of the output. The studies reviewed in this paper also provided an overview of the
applications of decision tree modelling in business management, engineering, and health-care
management domains, agriculture etc and all the studies concluded that decision tree has played a
vital role in prediction and classification modeling.
REFERENCES
1. Abishek Suresh, Velmurugan Karthikraja, Sajitha Lulu, Uma Kangueane, Pandjassarame
Kangueane 2009. A decision tree model for the prediction of homodimer folding
mechanism.Biomedical Informatics, Pondicherry 607402, AIMST University, Semeling
08100, Malaysia.
2. A. R. Senthil kumar, Manish Kumar Goyal, C. S. P. Ojha, R. D. Singh and P. K. Swamee,
2013. Application of Artificial Neural Network, fuzzy logic and decision tree algorithms
for modelling of streamflow at Kasol in India. IOSR Journal of Computer Engineering
(IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 6
3.
Atul Kumar Pandey1,Prabhat Pandey2,K.L. Jaiswal3,Ashish Kumar Sen Anyanwu , M. N., &
Shiva, S. G. (2009).A Heart Disease Prediction Model using Decision Tree.
4. Abolfazl Kazemia,*, Mohammad Esmaeil Babaeib, 2011.Modelling Customer Attraction
Prediction in Customer Relation Management using Decision Tree: A Data Mining
Approach. Department of management, Qazvin Branch, Islamic Azad University, Qazvin, Iran.
Department of management, Qazvin Branch, Islamic Azad University, Qazvin, Iran
Received 24 December, 2011; Revised 16 February, 2011; Accepted 14 March, 2015.
5. Bark Cheung Chiu, Geoffrey I. Webb, Zijian Zheng 2013. Using decision trees for agent
modelling: A study on resolving conflicting predictions
6. B.S. ZHANG, I. VALENTINE and P.D. KEMP Modelling hill country pasture production: a
decision tree approach. Institute of Natural Resources, Massey University, PB 11-222, Palmerston
North
7.
Baisen Zhang; Tillman, Russ December 2007 A decision tree approach to modelling
nitrogen fertiliser use efficiency in New Zealand pasture. Plant & Soil;Dec2007, Vol. 301 Issue
1/2, p267
8. C5 Algorithm. (2012). Retrieved fromhttp://rulequest.com/see5-comparison.html
9. David, H., Heikki, M., and Padhraic, S. (2001): Principles of Data Mining, MIT Press, Cambridge, MA
10. Dr. David L. Olson and Dr. Dursun Delen (2008): Advanced Data Mining Techniques
11. Felipe Liraa, Pedro S. Perezb, José A. Baranauskasb and Sérgio R. Nozawaa, 2014.Prediction
of Antimicrobial Activity of Synthetic Peptides by a Decision Tree Model. Laboratório de
Expressão Gênica, Universidade Nilton Lins, Manaus, Amazonas, Brazila. Departamento de
Computação e Matemática, Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto,
Universidade de São Paulo, Ribeirão Preto, Brazilb
12. Gregor Stiglic, Simon Kocbek, [...], and Peter Kokol 2012.Comprehensive Decision Tree
Models in Bioinformatics
13. Heiko Milde, Lothar Hotz, Jörg Kahl, 1999 Decision Tree Generation for Diagnosis in Real
World Industrial Application. Stephanie Wessel Laboratory for Artificial Intelligence, University
of Hamburg Vogt-Koelln-Str. 30, 22527 Hamburg, Germany.
14. Heiko Milde, Lothar Hotz, Jörg Kahl, Bernd Neumann, and Stephanie Wessel,1999.MAD: A
Real World Application of Qualitative Model-Based Decision Tree Generation for
DiagnosisLaboratory for Artificial Intelligence, University of Hamburg Vogt-Koelln-Str. 30, 22527
Hamburg, Germany
15. Harris, E. (2002). "Information Gain Versus Gain Ratio: A Study of Split Method Biases".
AMAI.
16. Imaging and Computational Intelligence (ICI) Group, School of Electrical & Electronic
Engineering, Universiti Sains Malaysia, Engineering Campus, Nibong Tebal, 14300 Penang,
Malaysia.Department of Computer Science and Software Engineering, Faculty of Science and
Information Technology, Jadara University, P.O. Box 733, Irbid 21110, Jordan
17. Jiawei Han,2006 Data Mining: Concepts and Techniques. Department of Computer Science,
University of Illinois at Urbana-Champaign.
18. Kuldeep Kumar, Bond University,Ping Zhang, University of Queensland,Brijesh Verma, Central
Queensland University. Application of Decision Trees for Mass Classification in Mammography
Volume 111, 2012, pp 567-573
19. Lee S, Park I. 2013 Application of decision tree model for the ground subsidence hazard
mapping near abandoned underground coal mines. Geological Mapping Department, Korea
Institute of Geoscience and Mineral Resources (KIGAM), 124 Gwahang-no Yuseong-gu, Daejeon
305-350, Republic of Korea. [email protected]
20. Mutasem Sh. Alkhasawneh, Umi Kalthum Ngah, Lea Tien Tay, Nor Ashidi Mat Isa, and
Mohammad Subhi Al-Batah, 2002 Decision Tree Applications for Data Modelling.
21.Manpreet Singh, Sonam Sharma, Avinash Kaur 2013.Performance Analysis of Decision
Trees. Department of Information Technology, Guru Nanak Dev Engineering College, Ludhiana,
Punjab, India CBS Group of Institutions, New Delhi,India Department of Computer Science,
Lovely Professional University, Phagwara, Punjab, India. International Journal of Computer
Applications (0975 – 8887) Volume 71– No.19, June 2013 10
22. Mohd Najwadi Yusoff and Aman Jantan, 2011
Optimizing Decision Tree in Malware
Classification System by using Genetic Algorithm.
23. Mahjoobi, J. ; Iran Univ. of Sci. & Technol., Tehran ; Shahidi, A.E.,2007.Application of
decision tree techniques for the Prediction of Significant Wave Height
24. Michael D. Twa, MS, FAAO, Srinivasan Parthasarathy, Cynthia Roberts, Ashraf M. Mahmoud,
Thomas W. Raasch, FAAO, and Mark A. Bullimore, MCOptom, 2005.FAAO
Automated
Decision Tree Classification of Corneal Shape,
25. Olugbenga Oluwagbemi, Uzoamaka Ofoezie, Nwinyi Obinna (2012): A Knowledge-Based Data Mining
System for Diagnosing Malaria Related Cases in Healthcare Management. Egyptian Computer Science
Journal Vol. 34, No.4 May 2010.
26. Peng Du, Ding Xiaoqing,2008 The Application of Decision Tree in Gender Classification
Congress on Image and Signal Processing, Vol. 4 May 27-May 30 ISBN: 978-0-7695-3119-9
27. Ruggieri, S. (2002, April). "Efficient C4.5 [classification algorithm]". IEEE Tansactions on
Knowledge and Data Engineering,pp. 438-444.
28. Rokach, Lior; Maimon, O. (2008). Data mining with decision trees: theory and applications.
World Scientific Pub Co Inc. ISBN 978-9812771711.
29. Sevgi Zeynep Dogan, David Arditi, H. Murat Günaydin,2008.Using Decision Trees for
Determining Attribute Weights in a Case-Based Model of Early Cost Prediction
Journal of Construction Engineering and Management, Vol. 134, No. 2, February 2008, pp. 146152, (doi: http://dx.doi.org/10.1061/(ASCE)0733-9364(2008)134:2(146))
30. SMITHA.T, #DR.V.SUNDARAM 2012, CASE STUDY ON HIGH DIMENSIONAL DATA
ANALYSIS
USING
DECISION
TREE
MODEL.
Karpagam
University,
Coimbatore;(Asst.Professor-MCA Dept, SNGIST, N.Paravoor, Kerala)Director-MCA, Karpagam
College of Engineering, Karpagam University, Coimbatore
31 . Quinlan, J. R., (1986). Induction of Decision Trees. Machine Learning 1: 81-106, Kluwer
Academic Publishers
32. Venkatadri, & Lokanatha. (2011).“A Comparative Study of Decision Tree Classification
Algorithms in Data Mining.”. International Journal of Computer Applications in Engineering,
Technology and Sciences, Vol 3,No.3.,pp. 230-240
33. Wang Wei, Wang Yunjia, Wang Qing, Lian Dajun, Wang Zhijie,2012 Application of Decision
Tree in Land Use Classification
34. Yoshikazu Goto1*, Tetsuo Maeda1 and Yumiko Goto, 2010 Decision-tree model for
predicting outcomes after out-of-hospital cardiac arrest in the emergency department
Some scientific fields that are currently receiving more attention both from scientific
communities and in the general public are competitive intelligence, smart city
(intelligent city), and territorial intelligence. Common to all these fields are the
concepts of information, information systems, knowledge, intelligence, decisionsupport systems, ubiquities, etc. The advantages for industries (production and
service industries) and governments (federal, state and local governments) cannot be
overemphasized. This resurgence is due to the impact of technologies for
dematerialization of objects and human activities.
Since the term “intelligence” is central for the theme of this conference, there is need
to specify its meaning that we are using for the conference.
Intelligence is a very general mental capability that, among other things, involves the
ability to reason, plan, solve problems, think abstractly, comprehend complex ideas,
learn quickly and learn from experience. It is not merely book learning, a narrow
academic skill, or test-taking smarts. Rather, it reflects a broader and deeper capability
for comprehending our surroundings—"catching on," "making sense" of things, or
"figuring out" what to do.
Individuals differ from one another in their ability to understand complex ideas, to
adapt effectively to the environment, to learn from experience, to engage in various
forms of reasoning, to overcome obstacles by taking thought. Although these
individual differences can be substantial, they are never entirely consistent: a given
person's intellectual performance will vary on different occasions, in different
domains, as judged by different criteria.
From this definition, it is obvious that intelligence in a way or the other rely on the
process of observation (comprehending our surroundings) and ensuring that the
observation is transformed into knowledge ("catching on," "making sense of things”,
or "figuring out what to do”).
Editors
Prof. Amos DAVID & Prof. Charles UWADIA
978-2-9546760-1-2