Download Application of Data Mining and Soft Computing Techniques for

International Journal of Trend in Research and Development, Volume 3(6), ISSN: 2394-9333 www.ijtrd.com Application of Data Mining and Soft Computing Techniques for Intelligent Medical Data Analysis 1 Vinutha .M.R and 2Dr. Chandrika .J., 1 Assistant Professor, 2Professor, 1 Information Science and Engineering, Malnad College of Engineering, Hassan, Karnataka, India 2 Computer Science and Engineering, Malnad College of Engineering, Hassan, Karnataka, India Abstract: Data Mining plays a crucial role in the field of Medicine. Although medical data is immeasurable and very rich in knowledge, but many a times the useful data may go fritter as we fail to extract the fruitful information from it. Extraction of effective information from wealthy medical data and making valuable decision for predicting the diseases increasingly becomes necessary. Data mining techniques can be used to mine the massive medical data and to learn the hidden pattern and relationships which helps in making effective decision. The paper strives to throws light on different methods used for knowledge abstraction by using data mining and soft computing techniques that are exercised in today's research for disease prediction. Keywords: Data Mining,classification,clustering,Naive Bayes, DecisionTrees,Genetic Algorithm,Fuzzy logic,Soft Computing. I. INTRODUCTION Medical diagnosis is extremely important which has to be carried out with lot of care and should be performed conclusively. Data mining is a process of finding the pattern from very large data set. In the field of medicine, data mining is used to make prognosis, diagnosis, and decision making and planning for treatment. So far researchers have applied the data mining and soft computing techniques independently for understanding the disease, various factors influencing the disease and also predict the survivals from the diseases like cancer, thyroid, diabetes, liver disorder, hepatitis, etc. Extracting useful information from wealthy medical data and making valuable decision for predicting the diseases progressively becomes necessary. Sometimes undesirable clinical decision risks the life of an individual. So there should be no compromise of decision taken towards the health of the patient. Mining of healthcare data should be done which helps in discovering the patterns that farce a significant role in making crucial medical decisions. Beyond any doubt there is a need for the system that discovers the hidden patterns, relationships and trends in them. The rest of the paper has been organized as follows, section two focus on the number of works in the literature related to diagnosis of the diseases using data mining and soft computing to some extent, section three and four gives a brief introduction to data mining and soft computing, section five spotlights on the proposed techniques and the final conclusion is presented in section six. II. RELATED WORK M.A.Jabbar et.al.[1] have developed computational investigational approach for early diagnosis of heart disease. They proposed a method to enhance the naive bayes. Researchers used discretization and feature subset selection measures like chi-square, gain ratio, one-r and genetic search. Finally concluded that one-r with genetic search for naive Bayes is best suitable for early diagnosis of heart disease. IJTRD | Nov-Dec 2016 Available [email protected] Mamuna Fatima et.al.[2] have applied k-means[18] [25] clustering on a preprocessed data set in order to extract useful features and patterns that helps in effective prediction of heart disease. Different attributes extracted by applying clustering algorithms on 1500 records are age, gender, known disease, heart rate, blood pressure, peak exercise, risk factor etc. A comparison of k-means with other clustering algorithms has been done and researches claimed that k-means performance is similar to k-means fast and x-means but better than k-medoids and dbscan. Sellappan Palaniappan et.al.[3] have developed a prototype called heart disease prediction system using three data mining classification techniques namely Naive bayes , Neural networks and decision trees. The system successfully extracts the hidden knowledge from heart disease database by considering fifteen features. Based on the outcome of the various test researchers insisted that the most efficient model to predict the heart disease is naive bayes followed by neural network and decision trees. Sivagowry.S et.al.[5] have conducted an empherical study on applying data mining techniques for the analysis and prediction of heart disease. Existing literature shows that classification tasks plays an important role in heart disease when compared with clustering, association rule and regression. They chose 14 attributes like age, sex, chest pain type,restingbloodsugar,cholesterol,restingelectrocardiographic etc. and used tangara tool to classify the data and evaluated using 10 fold cross validation Hezlin Aryani Abd rahman. et.al[11] developed neural network model, decision trees and logistic regression for the purpose of predicting the survival status of cardiac surgery patients. They claimed that neural network model is the best one. The data consists of 5154 observations with 23 variables. After the data cleaning process, a total of 4976 cases and 12 variables were used for analysis. The three predictive models were developed and compared using the classification accuracy rate, sensitivity and specificity. After comparing the three models researchers claimed that the neural network model is the best model used to predict the survival status of cardiac surgery patients. The neural network analysis showed that the important predictive variables are chest reopen, age, surgery type, gender, reoperation status, wound infection etc. N.Poolsawad.et.al.[13] have investigated the characteristics of a clinical data set using feature selection and classification techniques to deal with missing values and develop a method to quantify numerous complexities. Researchers also have given comprehensive evaluation of a set of diverse machine learning schemes for clinical decisions. The data set used in the study is a large cardio logical database called lifelab, which is prospective study consisting of 463 variables. 390 International Journal of Trend in Research and Development, Volume 3(6), ISSN: 2394-9333 www.ijtrd.com T.John peter et.al.[6] have proposed the use of patterns recognition and data mining techniques into the prediction models in the clinical domain of cardiovascular. The attribute relation file format which is an ascii text file that describes a list of instances sharing a set of attributes. Researchers proposed a method to investigate the performance of different classification algorithm on 270 medical records. After evaluation of all the classification algorithms they initiated that the naive bayes gives the better accuracy for heart disease prediction than other classifiers. Purushottam et.al.[17] have used knowledge extraction based on Evolutionary learning tool which is an open source java software tool to asses evolutionary algorithms for data mining problems. Researchers have used the cleveland data base, number of instances used are 303, database contains a total of 76 raw attributes but in experiment only 14 of them are actually used. The dataset is divided into two parts, out of 303 instances 151 instances are grouped as part 1 and 152 instances are placed in part 2 and achieved total percentage of successes 0.8675, where percentage of successes in part 1 is 0.863 and in part 2 is 0.872. V. Krishnaiah et.al.[19] have developed an approach for diagnosing the heart disease of the patient with fuzzy approach. In order to remove the uncertainty of the unstructured data an attempt has been made by the researchers with fuzzy k-nn approach by introducing an exponential membership function with standard deviation and mean calculated for the attribute measured . The data set used consists of 1200 records with a collection of 13 attributes it has been divided into 25 classes where each class consists of 48 records. To predict the correctness of the system the data set has been divided into equal amount of training and testing sets. Their work shows that interval approach in making data as symbolic data found to be successful in providing more accuracy of the system. Ankit Dewan et.al.[21] have applied various technique of machine learning such as Artificial neural network, back propagation genetic algorithm for optimization purpose. But due to its drawback of being stuck in local minima researchers were not able to achieve the maximum profit. So they employed the genetic algorithm that uses the phenomena of mutation and crossover over various generation. A conclusion has been made that neural network is best among all them classification techniques especially when prediction or classification of non linear data is considered. K.Srinivas et.al.[26] have focused on using the different algorithms for predicting combinations of several target attributes. They have presented automated and effective heart attack prediction system using data mining techniques. They have provided an efficient approach for the significant patterns from the heart disease data warehouses for the efficient prediction of heart attack based on calculated significant weight age, the frequent patterns having value greater than a predefined threshold were chosen for the valuable prediction of heart attack. For predicting the heart attack significantly fifteen attributes are listed in medical literature. They have proposed the inclusion of additional attributes stress, pollution, previous medical history and financial status. Researchers also urged that data mining techniques such as time series, clustering and association can also be used to analyze patient’s behavior. The architecture of the neural network used in this study is the multilayered feed forward network architecture. Stochastic back propagation IJTRD | Nov-Dec 2016 Available [email protected] algorithm is used for the construction of fuzzy based neural network. III . SOFT COMPUTING Soft computing is an innovative approach of building computationally intelligent systems. Soft computing basically uses solutions to the problems for which there is no algorithm that can compute exact solution in polynomial time. Soft computing can tolerate untruth and sometimes works fine with approximation. Soft computing includes several techniques like fuzzy logic[15], neural networks[4], artificial intelligence[29], genetic algorithm[28] and machine learning. A. Fuzzy Logic Fuzzy logic provides a powerful method to categorize a concept in an abstract way by introducing vagueness. Fuzzy logic is a scheme of many valued logic where in the truth values of variable may be real numbers between 0 and 1.It has been used to grasp the concept of partial truth, where the truth value may range between completely true or completely false. Fuzzy system output is a concurrence of all possible input and all possible rules, fuzzy logic system can be mannerly behaved when input values are not completely available or trustworthy. Steps in fuzzy logic process are, 1. 2. 3. Fuzzify all input values into fuzzy membership functions. Execute all applicable rules in the rule base to compute the fuzzy output functions. De-fuzzify the fuzzy output functions to get "crisp" output values. Limitations 1. Fuzzy logic systems use approximation so they are not good choice for managing systems that require extreme precision. 2. Fuzzy logic system are very expensive to develop as they often require extensive testing B. Genetic Algorithm Genetic algorithm is a powerful tool for optimization. A typical genetic algorithm[28] requires a genetic representation of the solution domain and a fitness function to evaluate the solution domain. There are three fundamental operators in genetic algorithm, Selection, Crossover and Mutation. selection which equates to survival of the fittest, crossover which represents mating between individuals, mutation which introduces random modifications. Using selection operator alone will tend to fill the population with copies of the best individual from the population , by using selection and crossover operators will tend to cause the algorithms to converge on a good but sub-optimal solution. by considering mutation alone induces a random walk through the search space. Using selection and mutation creates a parrallel, noisetolerant, hill climbing algorithm. IV . DATA MINING Data Mining [5] is the process of extracting the hidden and the useful pattern from very large data set. Primarily in data mining there are three techniques where in occurrences are grouped in to selected classes classification[10], clustering and Regression. Classification is a mode of analyzing data that extracts the models by representing important classes. Such extracted models are called classifiers which predict class label. Classification technique[16] is used for studying, examining the existing data and also to predict the future 391 International Journal of Trend in Research and Development, Volume 3(6), ISSN: 2394-9333 www.ijtrd.com behavior that helps in arriving at intelligent decision making which is very important. Classification is a two step process, first step is learning phase where the training data set is analyzed then the rules and patterns are extracted. The Second step is classification step which test the data and records the accuracy of classification patterns. Different classification methods are decision tree induction[14][16][23], rule based classifier, bayesian classifier[8], nearest neighbor classifier, artificial neural network[9][10] and support vector machine[12][20]. Regression is a data mining task which predicts a number. A regression task basically begins with a data set where in the target values are known. In the process of building a model a regression algorithm estimates the value of the target as a function of predictors for each case in the training data. The relationship between predictors and training data are summarized in a model, which can then be applied to different data set where the target values are unknown. A. Bayesian Network Bayesian Network is a graphical model used for identifying the relationships among a set of different features. It is a directed acyclic graph where all the nodes have one to one correspondence with the features of a data set. Bayesian classifier has revealed high accuracy and has high speed when applied to a large database. The naïve bayes classifier is based on bayes theorem. P(H|X) = P(X|H) P(H)P(X) (1) X- Evidence, described by measure on a set of attributes H-Hypothesis ,where data tuples X belongs to specified class c P(H|X)- posterior probability that the hypothesis holds X P (H)- prior probability of H, independent of X P(X|H)-posterior probability that of X conditioned on H Advantages 1. 2. 3. Capable of handling noisy data and is able to classify patterns of untrained data. Training and classification can be done at a faster rate. Less sensitive to irrelevant features. Limitations 1. 2. Poor interpretability. Assumes independency of features B. K-Nearest Neighbor The K-Nearest Neighbor is the simplest method where the object classification is based on the closet training example in the feature space. It computes the decision boundary both implicitly and explicitly. The computational complexity of nearest neighbor is the function of boundary complexity. No explicit training set is required. The neighbors are selected from a set of objects for which the correct classification is known. The best choice of k depends upon the data set higher the value of k reduces the effect of noise on the classification. If the value of K is small then noisy samples may become prominent, which results in misclassification error. Advantages 1. Understanding and implementation is easy. C. Decision Tree In Decision tree classification technique[14][16] classification is done based on splitting criteria. The decision tree is a flow chart like structure where classification of instances is done by sorting the instances based on the feature values. Each node reprsents an attribute, all branches reprsent an outcome of the test, each leaf node reprsents the class label. Tree is constructed using greedy method and top down fashion. The process of construcing the tree starts with training set recursively finding a split feature by maximising some local criterion. Methods like gini index , information gain gain ratio etc. can be used for finding the feature that best divide the training data. Algorithms like ID3, C4.5 and CART are widely used. 1. ID3 ID3 is an iterative dichometer 3. Construct the decisiontree using top down greedy approach.The method used for selecting the best attribute is information gain. In order to find information gain first entropy should be calculated. Entropy (S) = [P(I) log 2 P(I)] (2) Where – S refers to all the records P(I) refers to proportion of S belong to class I, C refers to class, E is over C is summation of all classifier. Information Gain(S,A)=Entropy(S)-∑((|Sv|/|S|) Entropy(Sv)) (3) Where A is feature for which gain will be calculated, V is set of all possible feature, Sv is the number of elements for each V. 2. CART CART is one of the most important tool in Data Mining. CART stands for Classification Regression Trees .It uses the binary splitting, so each node has exactly two outgoing edges and splitting is been done by finding the Gini index. Gini index = 1-∑ p2(I) Over the years CART has been the fastest and most skillful predictive modeling algorithm available to analyst. 3. C4.5 C4.5 is an extension of ID3 algorithm and referred to as the static classifier. Gain ratio measure is been used for the feature selection and to construct decision trees. Both continuous and discrete variable can be handled by C4.5. Classification can be quicker and highly accurate. Gain Ratio(A,S)=information gain(S,A) Entropy(S,A) Application 1. 1. 2. 2. IJTRD | Nov-Dec 2016 Available [email protected] (5) D. Support Vector Machine Support vector machine[20] trains the classifier to predict the class of new record, so it is called as training algorithm. support vector machines are supervised learning models with associated learning algorithms which analyze data used for classification and regression analysis. There are two key implementation of SVM, Mathematical programming and Kernel function. Limitations The local structure of the data is very sensitive and hence require large storage. When the size of sample is large computational costs are expensive. (4) Images can be classified using SVM. The SVM algorithm has been widely applied in the biological sciences s. They have been used to classify proteins with up to 90% of the compounds classified correctly. 392 International Journal of Trend in Research and Development, Volume 3(6), ISSN: 2394-9333 www.ijtrd.com V. PROPOSED METHODOLOGY We are planning to develop a system by combining the data mining and soft computing techniques to extract the patterns that helps in predicting the diseases accurately and efficiently. Developing such a system has an advantage in the medical field as it helps both doctors and the patients in order to do the early diagnosis of disease and to start a new treatment or to continue the treatment in a best possible way with an thorough understanding of a particular disease. The combination of data mining methods and soft computing tools can emphatically upsurge the efficiency of mining the quality patterns. 10. 11. CONCLUSION In this paper we have made an attempt to review the valuable literary work of different researchers in the field of medical data mining. We have summarized various approaches, algorithms applied in medical data mining which would be helpful for diagnosing the diseases. However the selection of data mining approaches depends mainly on the type of dataset. This study of survey unfolds the importance of research in the field of diagnosing life impending diseases. Accuracy of predicting the disease should be hundred percent otherwise wrong prediction will have the adverse effect on the patient. Vigorously there is a need for a system that reduces the false alarm rate which would help in early diagnosis of disease. The combination of data mining methods and soft computing tools like artificial neural networks, genetic alogarithms, fuzzy logic can tremendously improve the efficiency of mining the quality patterns. The use of soft computing tools in the field of data mining is a rising field of research especially with the ready obtainability of lavish data sets. 12. References 17. 1. 2. 3. 4. 5. 6. 7. 8. 9. M.A. Jabbar, B.L Deekshantulu, Priti Chandra, “Computational Intelligence Technique for Early Diagnosis of Heart Failure”, IEEE International conference on Engineering and Technology, 978-1-47991854-6/15/2015 IEEE. Manmuna Fathima, Iqra Basharat & Dr Shoab Ahmed khan, Ali Raja Anjum , “Biomedical (Cardiac) Data Mining: Extraction of Significant Patterns for Predicting Heart Condition. Sellappan Palaniappan , Rafiah Awang , “ Intelligent Heart Disease Prediction System using Data Mining Techniques”, 978-1-4244-1968-5/8/2008 IEEE. Monika Gandhi , Dr Shailendra Narayan Singh , “Prediction in Heart Disease Using Techniques of Data Mining” , First international conference on Futuristic trend in Computational analysis and Knowledge Management, 978-1-4799-8433-6/15/2015 IEEE Sivagowry .S, Dr Durairaj.M and Persia.A, “An Empeirical Study on Applying Data Mining Techniques for the Analysis and Prediction of Heart Disease” T.John Peter, K.Somasundaram, “An Empirical Study on Prediction of Heart Disease using Classification Data Mining Techniques”, International Conference on advances in Engineering, Science and Management, ISBN:978-81-909042-2-3/2012IEEE. Lamia AbedNoor Muhammed, “Using Data Mining Techniques to Diagnosis Heart Disease. Hanen Bouali, Jalel Akaichi, “Comparitive Study of Different Classification Techniques”, 13th International conference on Machine Learning and Applications”, 9781-4799-7415-3/14/2014 IEEE. Yanwei Xing, Jie Wang and Zhihong Zhao, Yonghong Gao, “ Combination Data Mining Methods with New IJTRD | Nov-Dec 2016 Available [email protected] 13. 14. 15. 16. 18. 19. 20. 21. 22. 23. 24. 25. Medical Data to Predicting Outcome of Coronary Heart Disease, International conference on convergence Information Technology, 0-7695-3038-9/07/2007/ IEEE. Kiyana Zolfaghar, Naren Meadem , Ankurteredesai, Senjti Basu Roy, Si-Chi Chin, Brian Muckian, “ Big Data Solutions for Predicting Risk-of-Readmission for Congestive Heart Failure Patients, 2013 IEEE conference on Big Data, 978-1-4799-1293-3/13/2013 IEEE. Hezlin Aryani Abd Rahman, Yap BeeWah, Zuraida Khairu Khairudin,Dr.Nik Nairan Abdullah, “Comparison of Predictive Models to Predict Survival of Cardiac Surgery Patients”, sponsored by fundamental Research Grant Scheme, Ministry of Higher Education, Malaysia. Seema Sharma, Jitendra Agarwal, Shika Agarwal Sanjeev Sharma, “Machine Learning Techniques for Data Mining : A Survey”, 978-1-4799-1597-2/13/2013 IEEE. N.Poolsawad, L.Moore, C.kambhampati, J.G.F.Cleland, “Handling Missing Values in Data Mining-A case study of Heart Failure Data Set”, Nineth international conference on Fuzzy Systems and Knowledge Discovery, 978-14673-0024-7/10/2012 IEEE. Renu Chauhan, Pinki Bajaj, Kavita Choudhary, Yogitha Gigras, “Frame work to predict Health Disesse using Attribute selection mechanism”,978-9-3805-44168/15/2015 IEEE. Camargo M, Jimenez D, Gallego L, “ Using of Data Mining and Soft Computing techniques for modeling bidding prices in power markets “, 978-1-4244-5098-52009 IEEE. Ranganatha S., Pooja Raj H.R., Anusha C., Vinay S.K., “ Medical Data Mining and Analysis for heart Disease Dataset Using Classification Techniques “. oPurushottam, Dr.Kanak Saxena, Richa Sharma, “Efficient Heart Disease Prediction System using Decision Tree “, 978-1-4799-8890-7/15/2015 IEEE B.V Ravindra, N.Sriraam, “ Discovery of Significant parameters in Kidney Dialysis Data sets by K-Means Algorithm “, International Conference on Circuits, Communication, control and Computing-2014 V.Krishnaiah, M.Srinivas, Dr G.Narsimha, Dr. N Subhash Chandra, “ Diagnosis of Heart Disease Patients Using Fuzzy Classification Technique ”. Eman AbuKhousa, Pies Campbell , “ Predictive Data Mining to Support Clinical Decisions: An Overview of Heart Disease Prediction systems”, 978-1-4673-11014/12/2012 IEEE Ankitha Dewan, Meghana Sharma, “Prediction of Heart Disease using a Hybrid Technique in Data Mining Classification”, 978-9-3805-4416-8/15/2015 IEEE M.Akhil Jabbar, Priti Chandra, B.L. Deekshatulu, “Prediction of Risk Score for Heart Disease using Associative Classification and Hybrid Feature Subset Selection”, 978-1-4673-5/12/2012 IEEE. A.J.Alijaaf, D.AI-Jumeily, A.J.Huassain, T.Dawson, P.Fergus and AI-Jumaily”, Predicting the Likelihood of Heart Failure with a Multi level Risk Assessment using Decision Tree”, 978-1-4799-5680-7/15/2015 IEEE. Hussah A.Al-Odan,Ahmad A. and Al-Daraiseh King Saud, “Open Source Data Mining Tools”, First international conference on electrical and information technologies – ICEIT2015,978-1-4799-7479-5/15/2015 IEEE. M.A.Nishara Banu and B.Gomathy “Disease Forecasting System using Data Mining Methods “,2014 International conference on intelligent computing applications, 978-14799-3966-4/2014 IEEE. 393 International Journal of Trend in Research and Development, Volume 3(6), ISSN: 2394-9333 www.ijtrd.com 26. K.Srinivas, Dr.G.Raghavendra Rao and Dr.A.Govardhan, “Analysis of Coronary Heart Disease and Prediction of Heart Attack in Coal Mining Regions using Data Mining Techniques”, The Fifth international conference on computer science and education, Hefei, China,978-14244-6005-2/10/2010 IEEE. 27. M.Akhil Jabbar, Dr B.L Deekshatulu, and Dr.Priti Chandra , “Heart Disease Prediction using Lazy Associative Classification”, 978-1-4673-5090-7/13/2013 IEEE 28. Kanhaiya Lal and N.C.Mahanti ," Role of Soft Computing as a Tool in DataMining", International Journal of Computer Science and Information Technologies, Vol. 2 (1) , 2011. 29. P.K.Vaishali and Dr.A.Vinayababu , " Application of Data Mining and Soft Computing in Bioinformatics", International Journal of Engineering Research and Applications IJTRD | Nov-Dec 2016 Available [email protected] 394

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Application of Data Mining and Soft Computing Techniques for