Download Application of Data Mining and Soft Computing Techniques for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
International Journal of Trend in Research and Development, Volume 3(6), ISSN: 2394-9333
www.ijtrd.com
Application of Data Mining and Soft Computing
Techniques for Intelligent Medical Data Analysis
1
Vinutha .M.R and 2Dr. Chandrika .J.,
1
Assistant Professor, 2Professor,
1
Information Science and Engineering, Malnad College of Engineering, Hassan, Karnataka, India
2
Computer Science and Engineering, Malnad College of Engineering, Hassan, Karnataka, India
Abstract: Data Mining plays a crucial role in the field of
Medicine. Although medical data is immeasurable and very
rich in knowledge, but many a times the useful data may go
fritter as we fail to extract the fruitful information from it.
Extraction of effective information from wealthy medical data
and making valuable decision for predicting the diseases
increasingly becomes necessary. Data mining techniques can
be used to mine the massive medical data and to learn the
hidden pattern and relationships which helps in making
effective decision. The paper strives to throws light on
different methods used for knowledge abstraction by using
data mining and soft computing techniques that are exercised
in today's research for disease prediction.
Keywords: Data Mining,classification,clustering,Naive Bayes,
DecisionTrees,Genetic Algorithm,Fuzzy logic,Soft Computing.
I.
INTRODUCTION
Medical diagnosis is extremely important which has to be
carried out with lot of care and should be performed
conclusively. Data mining is a process of finding the pattern
from very large data set. In the field of medicine, data mining
is used to make prognosis, diagnosis, and decision making and
planning for treatment. So far researchers have applied the data
mining and soft computing techniques independently for
understanding the disease, various factors influencing the
disease and also predict the survivals from the diseases like
cancer, thyroid, diabetes, liver disorder, hepatitis, etc.
Extracting useful information from wealthy medical
data and making valuable decision for predicting the diseases
progressively becomes necessary. Sometimes undesirable
clinical decision risks the life of an individual. So there should
be no compromise of decision taken towards the health of the
patient. Mining of healthcare data should be done which helps
in discovering the patterns that farce a significant role in
making crucial medical decisions. Beyond any doubt there is a
need for the system that discovers the hidden patterns,
relationships and trends in them.
The rest of the paper has been organized as follows, section
two focus on the number of works in the literature related to
diagnosis of the diseases using data mining and soft computing
to some extent, section three and four gives a brief introduction
to data mining and soft computing, section five spotlights on
the proposed techniques and the final conclusion is presented
in section six.
II.
RELATED WORK
M.A.Jabbar et.al.[1] have developed computational
investigational approach for early diagnosis of heart disease.
They proposed a method to enhance the naive bayes.
Researchers used discretization and feature subset selection
measures like chi-square, gain ratio, one-r and genetic search.
Finally concluded that one-r with genetic search for naive
Bayes is best suitable for early diagnosis of heart disease.
IJTRD | Nov-Dec 2016
Available [email protected]
Mamuna Fatima et.al.[2] have applied k-means[18]
[25] clustering on a preprocessed data set in order to extract
useful features and patterns that helps in effective prediction of
heart disease. Different attributes extracted by applying
clustering algorithms on 1500 records are age, gender, known
disease, heart rate, blood pressure, peak exercise, risk factor
etc. A comparison of k-means with other clustering algorithms
has been done and researches claimed that k-means
performance is similar to k-means fast and x-means but better
than k-medoids and dbscan.
Sellappan Palaniappan et.al.[3] have developed a
prototype called heart disease prediction system using three
data mining classification techniques namely Naive bayes ,
Neural networks and decision trees. The system successfully
extracts the hidden knowledge from heart disease database by
considering fifteen features. Based on the outcome of the
various test researchers insisted that the most efficient model
to predict the heart disease is naive bayes followed by neural
network and decision trees.
Sivagowry.S et.al.[5] have conducted an empherical
study on applying data mining techniques for the analysis and
prediction of heart disease. Existing literature shows that
classification tasks plays an important role in heart disease
when compared with clustering, association rule and
regression. They chose 14 attributes like age, sex, chest pain
type,restingbloodsugar,cholesterol,restingelectrocardiographic
etc. and used tangara tool to classify the data and evaluated
using 10 fold cross validation
Hezlin Aryani Abd rahman. et.al[11] developed
neural network model, decision trees and logistic regression
for the purpose of predicting the survival status of cardiac
surgery patients. They claimed that neural network model is
the best one. The data consists of 5154 observations with 23
variables. After the data cleaning process, a total of 4976
cases and 12 variables were used for analysis. The three
predictive models were developed and compared using the
classification accuracy rate, sensitivity and specificity. After
comparing the three models researchers claimed that the neural
network model is the best model used to predict the survival
status of cardiac surgery patients. The neural network analysis
showed that the important predictive variables are chest
reopen, age, surgery type, gender, reoperation status, wound
infection etc.
N.Poolsawad.et.al.[13]
have
investigated
the
characteristics of a clinical data set using feature selection and
classification techniques to deal with missing values and
develop a method to quantify numerous complexities.
Researchers also have given comprehensive evaluation of a set
of diverse machine learning schemes for clinical decisions.
The data set used in the study is a large cardio logical database
called lifelab, which is prospective study consisting of 463
variables.
390
International Journal of Trend in Research and Development, Volume 3(6), ISSN: 2394-9333
www.ijtrd.com
T.John peter et.al.[6] have proposed the use of
patterns recognition and data mining techniques into the
prediction models in the clinical domain of cardiovascular. The
attribute relation file format which is an ascii text file that
describes a list of instances sharing a set of attributes.
Researchers proposed a method to investigate the performance
of different classification algorithm on 270 medical records.
After evaluation of all the classification algorithms they
initiated that the naive bayes gives the better accuracy for heart
disease prediction than other classifiers.
Purushottam et.al.[17] have used knowledge
extraction based on Evolutionary learning tool which is an
open source java software tool to asses evolutionary
algorithms for data mining problems. Researchers have used
the cleveland data base, number of instances used are 303,
database contains a total of 76 raw attributes but in experiment
only 14 of them are actually used. The dataset is divided into
two parts, out of 303 instances 151 instances are grouped as
part 1 and 152 instances are placed in part 2 and achieved total
percentage of successes 0.8675, where percentage of successes
in part 1 is 0.863 and in part 2 is 0.872.
V. Krishnaiah et.al.[19] have developed an approach
for diagnosing the heart disease of the patient with fuzzy
approach. In order to remove the uncertainty of the
unstructured data an attempt has been made by the researchers
with fuzzy k-nn approach by introducing an exponential
membership function with standard deviation and mean
calculated for the attribute measured . The data set used
consists of 1200 records with a collection of 13 attributes it has
been divided into 25 classes where each class consists of 48
records. To predict the correctness of the system the data set
has been divided into equal amount of training and testing sets.
Their work shows that interval approach in making data as
symbolic data found to be successful in providing more
accuracy of the system.
Ankit Dewan et.al.[21] have applied various
technique of machine learning such as Artificial neural
network, back propagation genetic algorithm for optimization
purpose. But due to its drawback of being stuck in local
minima researchers were not able to achieve the maximum
profit. So they employed the genetic algorithm that uses the
phenomena of mutation and crossover over various generation.
A conclusion has been made that neural network is best among
all them classification techniques especially when prediction or
classification of non linear data is considered.
K.Srinivas et.al.[26] have focused on using the
different algorithms for predicting combinations of several
target attributes. They have presented automated and effective
heart attack prediction system using data mining techniques.
They have provided an efficient approach for the significant
patterns from the heart disease data warehouses for the
efficient prediction of heart attack based on calculated
significant weight age, the frequent patterns having value
greater than a predefined threshold were chosen for the
valuable prediction of heart attack. For predicting the heart
attack significantly fifteen attributes are listed in medical
literature. They have proposed the inclusion of additional
attributes stress, pollution, previous medical history and
financial status. Researchers also urged that data mining
techniques such as time series, clustering and association can
also be used to analyze patient’s behavior. The architecture of
the neural network used in this study is the multilayered feed
forward network architecture. Stochastic back propagation
IJTRD | Nov-Dec 2016
Available [email protected]
algorithm is used for the construction of fuzzy based neural
network.
III . SOFT COMPUTING
Soft computing is an innovative approach of building
computationally intelligent systems. Soft computing basically
uses solutions to the problems for which there is no algorithm
that can compute exact solution in polynomial time. Soft
computing can tolerate untruth and sometimes works fine with
approximation. Soft computing includes several techniques
like fuzzy logic[15], neural networks[4], artificial
intelligence[29], genetic algorithm[28] and machine learning.
A. Fuzzy Logic
Fuzzy logic provides a powerful method to categorize a
concept in an abstract way by introducing vagueness. Fuzzy
logic is a scheme of many valued logic where in the truth
values of variable may be real numbers between 0 and 1.It has
been used to grasp the concept of partial truth, where the truth
value may range between completely true or completely false.
Fuzzy system output is a concurrence of all possible input and
all possible rules, fuzzy logic system can be mannerly behaved
when input values are not completely available or trustworthy.
Steps in fuzzy logic process are,
1.
2.
3.
Fuzzify all input values into fuzzy membership
functions.
Execute all applicable rules in the rule base to
compute the fuzzy output functions.
De-fuzzify the fuzzy output functions to get "crisp"
output values.
Limitations
1. Fuzzy logic systems use approximation so they are
not good choice for managing systems that require
extreme precision.
2. Fuzzy logic system are very expensive to develop as
they often require extensive testing
B. Genetic Algorithm
Genetic algorithm is a powerful tool for optimization. A
typical genetic algorithm[28] requires a genetic representation
of the solution domain and a fitness function to evaluate the
solution domain. There are three fundamental operators in
genetic algorithm, Selection, Crossover and Mutation.
selection which equates to survival of the fittest, crossover
which represents mating between individuals, mutation which
introduces random modifications. Using selection operator
alone will tend to fill the population with copies of the best
individual from the population , by using selection and
crossover operators will tend to cause the algorithms to
converge on a good but sub-optimal solution. by considering
mutation alone induces a random walk through the search
space. Using selection and mutation creates a parrallel, noisetolerant, hill climbing algorithm.
IV . DATA MINING
Data Mining [5] is the process of extracting the hidden and the
useful pattern from very large data set. Primarily in data
mining there are three techniques where in occurrences are
grouped in to selected classes classification[10], clustering and
Regression. Classification is a mode of analyzing data that
extracts the models by representing important classes. Such
extracted models are called classifiers which predict class
label. Classification technique[16] is used for studying,
examining the existing data and also to predict the future
391
International Journal of Trend in Research and Development, Volume 3(6), ISSN: 2394-9333
www.ijtrd.com
behavior that helps in arriving at intelligent decision making
which is very important. Classification is a two step process,
first step is learning phase where the training data set is
analyzed then the rules and patterns are extracted. The Second
step is classification step which test the data and records the
accuracy of classification patterns. Different classification
methods are decision tree induction[14][16][23], rule based
classifier, bayesian classifier[8], nearest neighbor classifier,
artificial neural network[9][10] and support vector
machine[12][20]. Regression is a data mining task which
predicts a number. A regression task basically begins with a
data set where in the target values are known. In the process of
building a model a regression algorithm estimates the value of
the target as a function of predictors for each case in the
training data. The relationship between predictors and training
data are summarized in a model, which can then be applied to
different data set where the target values are unknown.
A. Bayesian Network
Bayesian Network is a graphical model used for identifying the
relationships among a set of different features. It is a directed
acyclic graph where all the nodes have one to one
correspondence with the features of a data set. Bayesian
classifier has revealed high accuracy and has high speed when
applied to a large database. The naïve bayes classifier is based
on bayes theorem.
P(H|X) = P(X|H) P(H)P(X)
(1)
X- Evidence, described by measure on a set of attributes
H-Hypothesis ,where data tuples X belongs to specified class c
P(H|X)- posterior probability that the hypothesis holds
X P (H)- prior probability of H, independent of X
P(X|H)-posterior probability that of X conditioned on H
Advantages
1.
2.
3.
Capable of handling noisy data and is able to classify
patterns of untrained data.
Training and classification can be done at a faster rate.
Less sensitive to irrelevant features.
Limitations
1.
2.
Poor interpretability.
Assumes independency of features
B. K-Nearest Neighbor
The K-Nearest Neighbor is the simplest method where the
object classification is based on the closet training example in
the feature space. It computes the decision boundary both
implicitly and explicitly. The computational complexity of
nearest neighbor is the function of boundary complexity. No
explicit training set is required. The neighbors are selected
from a set of objects for which the correct classification is
known. The best choice of k depends upon the data set higher
the value of k reduces the effect of noise on the classification.
If the value of K is small then noisy samples may become
prominent, which results in misclassification error.
Advantages
1.
Understanding and implementation is easy.
C. Decision Tree
In Decision tree classification technique[14][16] classification
is done based on splitting criteria. The decision tree is a flow
chart like structure where classification of instances is done by
sorting the instances based on the feature values. Each node
reprsents an attribute, all branches reprsent an outcome of the
test, each leaf node reprsents the class label. Tree is
constructed using greedy method and top down fashion. The
process of construcing the tree starts with training set
recursively finding a split feature by maximising some local
criterion. Methods like gini index , information gain gain ratio
etc. can be used for finding the feature that best divide the
training data. Algorithms like ID3, C4.5 and CART are widely
used.
1. ID3
ID3 is an iterative dichometer 3. Construct the decisiontree
using top down greedy approach.The method used for
selecting the best attribute is information gain. In order to find
information gain first entropy should be calculated.
Entropy (S) = [P(I) log 2 P(I)]
(2)
Where – S refers to all the records
P(I) refers to proportion of S belong to class I,
C refers to class,
E is over C is summation of all classifier.
Information Gain(S,A)=Entropy(S)-∑((|Sv|/|S|) Entropy(Sv))
(3)
Where A is feature for which gain will be calculated,
V is set of all possible feature,
Sv is the number of elements for each V.
2. CART
CART is one of the most important tool in Data Mining.
CART stands for Classification Regression Trees .It uses the
binary splitting, so each node has exactly two outgoing edges
and splitting is been done by finding the Gini index.
Gini index = 1-∑ p2(I)
Over the years CART has been the fastest and most skillful
predictive modeling algorithm available to analyst.
3. C4.5
C4.5 is an extension of ID3 algorithm and referred to as the
static classifier. Gain ratio measure is been used for the feature
selection and to construct decision trees. Both continuous and
discrete variable can be handled by C4.5. Classification can be
quicker and highly accurate.
Gain Ratio(A,S)=information gain(S,A)
Entropy(S,A)
Application
1.
1.
2.
2.
IJTRD | Nov-Dec 2016
Available [email protected]
(5)
D. Support Vector Machine
Support vector machine[20] trains the classifier to predict the
class of new record, so it is called as training algorithm.
support vector machines are supervised learning models with
associated learning algorithms which analyze data used for
classification and regression analysis. There are two key
implementation of SVM, Mathematical programming and
Kernel function.
Limitations
The local structure of the data is very sensitive and
hence require large storage.
When the size of sample is large computational costs
are expensive.
(4)
Images can be classified using SVM.
The SVM algorithm has been widely applied in the
biological sciences s. They have been used to classify
proteins with up to 90% of the compounds classified
correctly.
392
International Journal of Trend in Research and Development, Volume 3(6), ISSN: 2394-9333
www.ijtrd.com
V. PROPOSED METHODOLOGY
We are planning to develop a system by combining the data
mining and soft computing techniques to extract the patterns
that helps in predicting the diseases accurately and efficiently.
Developing such a system has an advantage in the medical
field as it helps both doctors and the patients in order to do the
early diagnosis of disease and to start a new treatment or to
continue the treatment in a best possible way with an thorough
understanding of a particular disease. The combination of data
mining methods and soft computing tools can emphatically
upsurge the efficiency of mining the quality patterns.
10.
11.
CONCLUSION
In this paper we have made an attempt to review the valuable
literary work of different researchers in the field of medical
data mining. We have summarized various approaches,
algorithms applied in medical data mining which would be
helpful for diagnosing the diseases. However the selection of
data mining approaches depends mainly on the type of dataset.
This study of survey unfolds the importance of research in the
field of diagnosing life impending diseases. Accuracy of
predicting the disease should be hundred percent otherwise
wrong prediction will have the adverse effect on the patient.
Vigorously there is a need for a system that reduces the false
alarm rate which would help in early diagnosis of disease. The
combination of data mining methods and soft computing tools
like artificial neural networks, genetic alogarithms, fuzzy logic
can tremendously improve the efficiency of mining the quality
patterns. The use of soft computing tools in the field of data
mining is a rising field of research especially with the ready
obtainability of lavish data sets.
12.
References
17.
1.
2.
3.
4.
5.
6.
7.
8.
9.
M.A. Jabbar, B.L Deekshantulu, Priti Chandra,
“Computational Intelligence Technique for Early
Diagnosis of Heart Failure”, IEEE International
conference on Engineering and Technology, 978-1-47991854-6/15/2015 IEEE.
Manmuna Fathima, Iqra Basharat & Dr Shoab Ahmed
khan, Ali Raja Anjum , “Biomedical (Cardiac) Data
Mining: Extraction of Significant Patterns for Predicting
Heart Condition.
Sellappan Palaniappan , Rafiah Awang , “ Intelligent
Heart Disease Prediction System using Data Mining
Techniques”, 978-1-4244-1968-5/8/2008 IEEE.
Monika Gandhi , Dr Shailendra Narayan Singh ,
“Prediction in Heart Disease Using Techniques of Data
Mining” , First international conference on Futuristic trend
in Computational analysis and Knowledge Management,
978-1-4799-8433-6/15/2015 IEEE
Sivagowry .S, Dr Durairaj.M and Persia.A, “An
Empeirical Study on Applying Data Mining Techniques
for the Analysis and Prediction of Heart Disease”
T.John Peter, K.Somasundaram, “An Empirical Study on
Prediction of Heart Disease using Classification Data
Mining Techniques”, International Conference on
advances in Engineering, Science and Management,
ISBN:978-81-909042-2-3/2012IEEE.
Lamia AbedNoor Muhammed, “Using Data Mining
Techniques to Diagnosis Heart Disease.
Hanen Bouali, Jalel Akaichi, “Comparitive Study of
Different Classification Techniques”, 13th International
conference on Machine Learning and Applications”, 9781-4799-7415-3/14/2014 IEEE.
Yanwei Xing, Jie Wang and Zhihong Zhao, Yonghong
Gao, “ Combination Data Mining Methods with New
IJTRD | Nov-Dec 2016
Available [email protected]
13.
14.
15.
16.
18.
19.
20.
21.
22.
23.
24.
25.
Medical Data to Predicting Outcome of Coronary Heart
Disease, International conference on convergence
Information Technology, 0-7695-3038-9/07/2007/ IEEE.
Kiyana Zolfaghar, Naren Meadem , Ankurteredesai, Senjti
Basu Roy, Si-Chi Chin, Brian Muckian, “ Big Data
Solutions for Predicting Risk-of-Readmission for
Congestive Heart Failure Patients, 2013 IEEE conference
on Big Data, 978-1-4799-1293-3/13/2013 IEEE.
Hezlin Aryani Abd Rahman, Yap BeeWah, Zuraida
Khairu Khairudin,Dr.Nik Nairan Abdullah, “Comparison
of Predictive Models to Predict Survival of Cardiac
Surgery Patients”, sponsored by fundamental Research
Grant Scheme, Ministry of Higher Education, Malaysia.
Seema Sharma, Jitendra Agarwal, Shika Agarwal Sanjeev
Sharma, “Machine Learning Techniques for Data Mining :
A Survey”, 978-1-4799-1597-2/13/2013 IEEE.
N.Poolsawad, L.Moore, C.kambhampati, J.G.F.Cleland,
“Handling Missing Values in Data Mining-A case study of
Heart Failure Data Set”, Nineth international conference
on Fuzzy Systems and Knowledge Discovery, 978-14673-0024-7/10/2012 IEEE.
Renu Chauhan, Pinki Bajaj, Kavita Choudhary, Yogitha
Gigras, “Frame work to predict Health Disesse using
Attribute
selection
mechanism”,978-9-3805-44168/15/2015 IEEE.
Camargo M, Jimenez D, Gallego L, “ Using of Data
Mining and Soft Computing techniques for modeling
bidding prices in power markets “, 978-1-4244-5098-52009 IEEE.
Ranganatha S., Pooja Raj H.R., Anusha C., Vinay S.K., “
Medical Data Mining and Analysis for heart Disease
Dataset Using Classification Techniques “.
oPurushottam, Dr.Kanak Saxena, Richa Sharma,
“Efficient Heart Disease Prediction System using Decision
Tree “, 978-1-4799-8890-7/15/2015 IEEE
B.V Ravindra, N.Sriraam, “ Discovery of Significant
parameters in Kidney Dialysis Data sets by K-Means
Algorithm “, International Conference on Circuits,
Communication, control and Computing-2014
V.Krishnaiah, M.Srinivas, Dr G.Narsimha, Dr. N Subhash
Chandra, “ Diagnosis of Heart Disease Patients Using
Fuzzy Classification Technique ”.
Eman AbuKhousa, Pies Campbell , “ Predictive Data
Mining to Support Clinical Decisions: An Overview of
Heart Disease Prediction systems”, 978-1-4673-11014/12/2012 IEEE
Ankitha Dewan, Meghana Sharma, “Prediction of Heart
Disease using a Hybrid Technique in Data Mining
Classification”, 978-9-3805-4416-8/15/2015 IEEE
M.Akhil Jabbar, Priti Chandra, B.L. Deekshatulu,
“Prediction of Risk Score for Heart Disease using
Associative Classification and Hybrid Feature Subset
Selection”, 978-1-4673-5/12/2012 IEEE.
A.J.Alijaaf, D.AI-Jumeily, A.J.Huassain, T.Dawson,
P.Fergus and AI-Jumaily”, Predicting the Likelihood of
Heart Failure with a Multi level Risk Assessment using
Decision Tree”, 978-1-4799-5680-7/15/2015 IEEE.
Hussah A.Al-Odan,Ahmad A. and Al-Daraiseh King
Saud, “Open Source Data Mining Tools”, First
international conference on electrical and information
technologies – ICEIT2015,978-1-4799-7479-5/15/2015
IEEE.
M.A.Nishara Banu and B.Gomathy “Disease Forecasting
System using Data Mining Methods “,2014 International
conference on intelligent computing applications, 978-14799-3966-4/2014 IEEE.
393
International Journal of Trend in Research and Development, Volume 3(6), ISSN: 2394-9333
www.ijtrd.com
26. K.Srinivas, Dr.G.Raghavendra Rao and Dr.A.Govardhan,
“Analysis of Coronary Heart Disease and Prediction of
Heart Attack in Coal Mining Regions using Data Mining
Techniques”, The Fifth international conference on
computer science and education, Hefei, China,978-14244-6005-2/10/2010 IEEE.
27. M.Akhil Jabbar, Dr B.L Deekshatulu, and Dr.Priti
Chandra , “Heart Disease Prediction using Lazy
Associative Classification”, 978-1-4673-5090-7/13/2013
IEEE
28. Kanhaiya Lal and N.C.Mahanti ," Role of Soft Computing
as a Tool in DataMining", International Journal of
Computer Science and Information Technologies, Vol. 2
(1) , 2011.
29. P.K.Vaishali and Dr.A.Vinayababu , " Application of Data
Mining and Soft Computing in Bioinformatics",
International Journal of Engineering Research and
Applications
IJTRD | Nov-Dec 2016
Available [email protected]
394