Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
文档下载 免费文档下载 http://www.51wendang.com/ 本文档下载自文档下载网,内容可能不完整,您可以点击以下网址继续阅读或下载: http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92 data mining Mining of Classification Patterns in Clinical Data through Data Mining Algorithms Shomona Gracia Jacob Ph.D Research Scholar Rajalakshmi Engineering College Thandalam, Chennai, India. 91-044-26261340, 91-9841242291 R.Geetha Ramani Professor & Head-Dept.of CSE Rajalakshmi Engineering College Thandalam, Chennai, India 91-9442891948 [email protected] [email protected] Categories and Subject Descriptors I.5 [Pattern Recognition]: Design Methodology ±classifier design and evaluation. 文档下载 免费文档下载 http://www.51wendang.com/ ABSTRACT Data mining on clinical data is a challenging area in the field of medical research, aiming at predicting and discovering patterns of disease occurrence and prognosis based on detected symptoms and reported health conditions. Data mining is the process of recovering related, significant and imperative information from a copious collection of comprehensive data. The main focus thttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92o of this highlight paper is the significance of machine learning and data mining techniques on classifying clinical datasets downloaded from the University of California, Irvine (UCI) Machine Learning Repository viz, Mammography masses, Orthopaedic (Vertebral Column) ailments, Dermatology infection, SPECTF (Single Proton Emission Computed Tomography) Heart and Thyroid diseases. We have made a careful selection of clinical data from various domains in order to identify the performance of the data mining algorithms on different types of clinical datasets. Our research work incorporates the design of a data mining framework that generates an efficient classifier that is trained on all the clinical datasets stated, by learning patterns and rules framed by executing classification algorithms. This enables the formulation of precise and accurate decisions to classify an unseen medical test data in the related field. Our results validate and confirm 4http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92XLQODQ?V WKDW &?????? FODVVLILFDWLRQ DOJRULWKP and the Random Tree algorithm yield 100 percent classification accuracy on SPECTF Heart, Orthopaedic (Vertebral Column) ailments, Thyroid and Dermatology infection datasets while Binary Logistic Regression and CS-MC4 also give 100 percent classifier accuracy on the SPECTF Heart Dataset and Multinomial Logistic Regression too classifies the Dermatology dataset with 100percent accuracy. However on the Mammography training dataset, the classification accuracy produced by Random Tree DQG 4XLQODQ?V &?????? LV RQO????????????. We modify the parameters that control the decision tree size and the confidence level to achieve 100 percent classifier accuracy. The Decision tree JHQHUDWHG EWKH 4XLQODQ?V &?????? 文档下载 免费文档下载 http://www.51wendang.com/ DOJRULWKP LV VPDOOHU WKDQ WKH decision tree given by the Random Tree classification technique. Our research states that the Quinlan's C4.5 algorithm is mostefficient in building a classifier thttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92rained on clinical datasets from diverse domains that could provide 100 percent classifier accuracy on a previously unseen test data. General Terms Algorithms, Performance, Design Keywords Data Mining, Machine Learning, Classification, Clinical data 1.INTRODUCTION Data mining (Eugenia, 2008) is a process of retrieving consequential and imperative information from an exhaustive collection of data. Data mining tools (Witten, 2011a, 2011b) predict future patterns and possible trends that enable users to make knowledge-based and knowledge-driven decisions. They search records for concealed, hidden information in order to provide extrapolative information that experts may otherwise overlook. Machine intelligence (Mitchell, 1983; S.B.Kotsiantis, 2007) is one of the core components of data mining that involvesa study and analysis of computer algorithms and techniques that tohttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92 are automatically expected improve with experience. One ofthe key application areas in the field of machine learning (Mitchell, 1997) explores data mining programs that discover general rules in sizeable datasets. One of the major application and research area of machine learning is the improvement of classifier accuracy by learning. Classification (Iavindrasana, 文档下载 免费文档下载 http://www.51wendang.com/ 2008) is a data mining technique that is derived from the concept of machine learning. In classification our goal is to find a model for the class or target attribute as a function of other predictor attributes. In this research work we have analyzed the performance of classification algorithms in performing both binary and multi-class classification. In binary classification (Wu, 2008), the target class can have only two possible values whereas in the latter, the target attribute can have multiplevalues. Clinical data mining (Iavindrasana, 2009) is committed to retrieving novehttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92l and previously unknown information from medical records and databases. Classification (Ressom, 2008) is the most commonly applied data mining function on clinical datasets. Application of technology in the field of medicine will certainly advance the current state of disease prediction and prognosis. Early and precise prediction and classification of ailments is expected to aid researchers in designing drugs that will prevent and arrest the progress of life-threatening ailments 997 and provide hope to a large section of the ailing population. Inthis paper we bring together classification techniques to discover patterns in medical records and highlight the performance of the training models that will enable more precise malady prediction and classification of ailments. In this paper we apply nearly twenty classification algorithms viz.?? 4XLQODQ?V &???????? Random Tree, Binary Logistic Regression http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92(BLR), Multinomial Logistic Regression(MLR), Partial Least Squares for Classification (C-PLS), Classification Tree(C-RT), Cost-Sensitive Classification Tree(CS-CRT), Cost-sensitive Decision 文档下载 免费文档下载 http://www.51wendang.com/ Tree algorithm(CS-MC4), SVM for classification(C-SVC), Iterative Dichomotiser(ID3), K-Nearest Neighbor(K-NN), Linear Discriminant Analysis (LDA), Logistic Regression(Log-Reg), Multilayer Perceptron(MP), Na?ve Bayes Classifier(NBC), Partial Least Squares -Discriminant/Linear Discriminant Analysis(PLS-DA/LDA), Prototype-Nearest Neighbor(P-NN),Radial Basis Function (RBF), and Support Vector Machine(SVM) classification algorithms on the training clinical datasets and evaluate the performance of the algorithms based on their error-rates, accuracy and decision tree size. Prather et.al, (1997) have made a complete study on the data mining techniques involved in knowledge discovery from medical databases. However their interests were focused mainhttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ly on data warehousing, data cleaning and data analysis. They have collected around 3902 patient records relating to Obstetrics and this data was evaluated by means of exploratory factor analysis technique for identification of factors contributing to pretermbirth of the foetus. Three factors were detected by the explorators for further investigation. Nassif et.al, (2009) et.al has described a concept information extraction technique given a lexicon and a BI-RADS feature extraction algorithm for clinical data mining on the mammography dataset alone. However the number of records is limited to 100. They present a BIRADS features extraction algorithm for clinical data mining. They report that their algorithm achieves 97.7% precision, 95.5% recall and an F-score of 0.97. It is said to outperform manual feature extraction at the 5% statistical significance level. It particularly achieves a high recall gain over manual indexing. http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92The proposed data framework in our paper is portrayed in detail in the following section. 1.1Paper Organization mining 文档下载 免费文档下载 http://www.51wendang.com/ Section 2 briefs about the related work in the area of data mining while Section 3 describes the proposed system. Section 4 substantiates our findings with necessary results and Section 5 concludes the paper. 3.SYSTEM DESIGN The data mining (Han and Kamber, 2000) framework designed to generate an efficient classifier is clearly described in the following sub-sections. The diagrammatic representation of the data mining framework proposed in our paper is displayed in Figure 1. 2.BACKGROUND OF THE WORK Previous research findings in the area of clinical data mining and knowledge discovery are briefly reported in the following paragraphs. Bloch et.al (Bloch, 2011) has evaluated the performance of classifiers viz, J48, Random Forest, Na?ve Bayes, AdaBohttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ost M1 and Bagging on the datasets viz, Labor, Iris, Vertebral Column, Ionosphere, Dermatology and Car. They have reported 100% accuracy for Random Forest Classifier on all the datasets. Moreover they have evaluated the classifier performance using ROC values and Kappa Statistics in Weka Data mining tool. Soni et.al, (March, 2011) have performed an analysis of data mining techniques using Tanagra. But they have made a complete study of only the heart disease dataset and have provided a survey on classifier performance with the result from cross-fold validation. Their results however do not report 100 percent accuracy. Kusiak et.al (2000) has discussed the problem of predicting outcomes in medical and engineering datasets. Theyhave discussed the performance of only two approaches viz, rough set theory and decision tree generation. They report 文档下载 免费文档下载 http://www.51wendang.com/ 100 percent prediction accuracy for the medical datasets but revealed only 98.6% accuracy on http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92the engineering data. Mullins et.al, (Mullins et.al,2005) applied a new data mining WHFKQLTXH QDPHG ? HDOWKPLQHU? WR D ODUJH FRKRUW RI ?????????????? inpatient and outpatient records from an academic digital system.HealthMiner approaches knowledge discovery using three unsupervised rule discovery methods: CliniMiner, Predictive Analysis, and Pattern Discovery. They tabulated the results for data trend characterization, discovery of medically known/unknown co-relations and identification of data anomalies using all the three unsupervised methods. Their results suggest that unsupervised data mining of large clinical repositories is feasible. Figure 1. Data Mining Framework for Design of Efficient Classifier The proposed system comprises of the Training data Selection Phase, Data Pre-processing followed by the Classification Phase and the Evaluation phase that selects the best classifier. After 998Inthttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ernationalConferenceo nAdvancesinComputing,CommunicationsandInformatics(ICACCI-2012) choosing the best classifier, the test data is loaded to verify the accuracy of the selected classifier. Table 1. Description of Clinical Datasets No. Domain predictor instances attributes 9612672800310 文档下载 免费文档下载 http://www.51wendang.com/ 1-4, years Values 0 to 100 True/FContinous values (numbers) 0-3 3.1Training Datasets The medical datasets that have been utilized to train the classifier have been taken from a broad range of medical ailments, each dataset having different types of data that include numbers, text, and domain of values. Hence the classification algorithm performing well on each of these datasets will definitely aid in predicting disease occurrence patterns as well as provide scope for further discovery of unknown trends existing in the stored medical records. Each of the datasets, attributes,http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92 number their of training examples and possible values has been clearly portrayed in tabular form in Table 1. These clinicaldatasets have been downloaded from the University of California, Irvine Machine Learning Repository (UCI, Irvine) to carry out this research work. 42. SPECT-Heart disease 44 286 abnormality 5. 文档下载 免费文档下载 http://www.51wendang.com/ Dermatology infection 33 366 The Dermatology database is concerned with the differential diagnosis of Erythemato-squamous diseases that shares the clinical features of erythema and scaling, with very little 3.2 Data Pre-processing differences. The diseases in this group are Psoriasis, Seboreic The datasets are explored individually since each dataset targets Dermatitis, Lichen Planus, Pityriasis Rosea, Chronic Dermatitis, a unique ailment, has varying number of records and the values and Pityriasis Rubra Pilaris. The values of histopathhttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ological the are discrete or continuous according to the nature of the attribute features are determined by an analysis of the samples under a predicting the specific disease. Moreover the target values of the microscope. The Thyroid data comprises of continuous and class label too vary according to the particular malady. Hence the discrete valued attributes based on the Thyroxine levels and the datasets available in text and .arff formats are imported into hormones T3, T4 and TSH. The classification may be a negativeExcel spreadsheet with the column headers clearly indicating the case or a discordant case. The SPECTF-Heart dataset describes predictor attributes and the specific target attribute. Missing diagnosing of cardiac Single Proton Emission Computed values are appropriately replaced with related values. The Excel Tomography (SPECT) images. Each of the patients is classified spreadsheet is then loaded into Tanagra, a Data Mining Thttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ool into two categories: normal and abnormal. The database of 267 (Tanagra). The attributes are specified and the loaded data is SPECT image sets (patients) was processed to extract 44 features visualized for verification. Once the data is found 文档下载 免费文档下载 http://www.51wendang.com/ to be precisely that summarize the original SPECT images available at therecorded, we proceed with classification as explained in the public data repository ±UCI Irvine Machine Learning Repository. following section. This is named as SPECTF in the repository to differentiate it from the SPECT data set that is further processed to reduce the 3.3Classification attributes to 22. The Mammography masses data set can be used The main objective of classification is to accurately predict the to predict the severity (benign or malignant) of a mammographic target class for each record. The training process of classification mass lesion from BI-RADS attributes and the patient's age. It attempts discohttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ver to relationships between the predictor and thecontains a BI-RADS assessment, the patient's age and three BI-target values. Classification algorithms (Elter, 2007) differ in the RADS attributes techniques they use to determine these relationships. These together with the severity field on full field digital mammograms relationships are further recapitulated in a model which is then collected at the Institute of Radiology of the applied to a record (test data) where the class label is unknown. University Erlangen-Nuremberg between 2003 and 2006. algorithms in our survey on These can be an The best performing classification the datasets using the Tanagra data mining tool are briefly indication of how well a CAD system performs compared to the explained in the following sub-sections. radiologists. In the Orhopaedic (Vertebral Column) dataset each patient is represented in the data set by six biomechanical 3.3.1Quhttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92inlans C4.5 Algorithm attributes derived from the shape and orientation of the pelvis C4.5 is an algorithm used to generate a decision tree and and lumbar spine. They can be classified as normal and abnormal 文档下载 免费文档下载 http://www.51wendang.com/ was developed by Ross Quinlan (Kohavi and Quinlan, 1999). categories. The training data is a set S=S1?? 6???of already classified samples. Each sample Si=k1, k2,.. is a vector where k1, k2, represent attributes or features of the sample. The training data is augmented with a vector C=C1, C2, where C1, C2represent the Permission to make digital or hard copies of all or part of this work for class to which each sample belongs. At each node of the tree,personal or classroom use is granted without fee provided that copies are C4.5 chooses one attribute of the data that is most effective in not made or distributed for profit or commercial advantage and that copieshttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92 splitting the set of samples into subsets enriched in one class or bear this notice and the full citation on the first page. To copy otherwise, or the other (Quinlan, 1986). Its criterion is the normalized republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. information gain that results from choosing an attribute forICACCI '12, August 03 - 05 2012, CHENNAI, India Copyright 2012 ACM 978-1-4503-1196-????????????????????????????? InternationalConferenceonAdvancesinComputing,CommunicationsandInformatics(ICACCI -2012)999 文档下载 免费文档下载 http://www.51wendang.com/ splitting the data. Sample rules generated by the C4.5 algorithm for SPECTF heart dataset are given in Figure 2. F21R =60.5000 F20S F21R F21R >= 77.5000 then DIAGNOSIS= ABNORMAL Figure 2. Sample Classification Rules generated by the C4.5 Algorithm on the SPECTF- Heart Dataset The atthttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurseson the smaller sub lists. estimate. It minimizes the expected loss using Misclassification matrix for the detection of the best prediction within leaves (Tanagra tutorials). The target attribute must be discrete in nature while the predictors may be continuous or discrete valued. 3.4Evaluation Phase The best performing efficient classifier is chosen based on the produced Misclassification Rates and the Decision tree sizegenerated by the respective algorithms. The details of the classifier results are clearly outlined in Section 4. The Random Tree algorithm, C4.5 algorithm, CS-MC4, Binary Logistic Regression and the Multinomial Logistic Regression algorithms have proved to be most accurate as substantiated by the resultsobtained in our classifier performance analysis. 3.3.2Random Tree http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92Algorithm 文档下载 免费文档下载 http://www.51wendang.com/ The Random trees have been introduced by Leo Breiman and Adele Cutler. Random trees are a collection (ensemble) of tree predictors that is called forest. The classification works as follows: the random trees classifier takes the input feature vector, classifies it with every tree in the forest, and outputs the class ODEHO WKDW UHFHLYHG WKH PDMRULWRI ?YRWHV? (Ressom, 2008). In the case of regression the classifier response is the average of the responses over all the trees in the forest. In most machinelearning algorithms, the best approximation to the target function LV DVVXPHG WR EH WKH ?VLPSOHVW? FODVVLILHU WKDW ILWV WKH JLYHQ GDWD?? since more complex models tend to over fit the training data and generalize poorly (Tanagra tutorials). Sample rules generated by the Random Tree algorithm for SPECTF heart dataset are given in Figure 3. F22S F20S= 68.5000 F19R 3.5Test Phase ://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92The test phase is necessary to corroborate the classifier accuracy. Consequently we present the classifier with a clinical test data, one from each kind of ailment on which the classifier was trained and validate the precision with which the classification is made. The test data was correctly classified for each category of ailment covered in this paper. The performance of the classifier is comprehensively dealt with in the following section. 4.PERFORMANCE EVALUATION The classification techniques that have been analyzed on the medical datasets are 文档下载 免费文档下载 http://www.51wendang.com/ graded based on certain performance measures that include Error- rate, Accuracy and Decision Tree size. They are succinctly presented in the following sub-section. 4.1Measures of Performance The best performing efficient classifier is chosen based on the following measures: 4.1.1Accuracy The accuracy (Han and Kamber, 2000) of http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. 3.3.3Multinomial Logistic Regression Multinomial logistic regression (William, 2003; Hilbe, 2009) is a regression model which generalizes logistic regression by allowing more than two discrete outcomes. This model is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (Tanagra tutorials). The variables may bereal-valued, binary-valued, categorical-valued, etc.., .The multinomial logistic model assumes the data to be case specific, and that is, each independent variable has a single value for eachcase. The multinomial logistic model also assumes that thedependent variable cannot be perfectly predicted from the independent variables for any case. Here co linearity (Agresti, 2007; Cios, 2001) is asshttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92umed to be relatively low, as it becomes difficult to differentiate between the impacts of several variables if they are highly correlated. 文档下载 免费文档下载 http://www.51wendang.com/ 4.1.2Error-Rate Also called the Misclassification rate, it is measured as 1-Acc (M), where Acc (M) is the accuracy of M. 4.1.3Decision Tree Size The size of the decision tree (Zhu, 2007; Kohavi and Quinlan, 1999) is specified by the number of nodes. The tree which is able to predict the correct class label with the smallest number of nodes is taken to be the efficient classifier. 4.2Experimental Results The experimental results for the twenty classification algorithms on the medical datasets are clearly outlined in the followingsection. 7KH 5DQGRP 7UHH FODVVLILFDWLRQ DOJRULWKP DQG WKH 4XLQODQ?V C4.5 algorithm produce 100 percent accuracy on the SPECTF Heart dataset, Orthopaedic ailment, Dermatology and Thyroid disease dataset. http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92Logistic Binary Regression and CS-MC4 produce 100 percent accuracy on the SPECTF Heart dataset and Multinomial Logistic Regression produce 100 percent accuracy 3.3.4Cost Sensitive-Misclassification Cost Matrix (CS-MC4) Cost-Sensitive classification algorithm (Kotsiantis, 2007; Wu, 2008) is similar to C4.5 but uses m-estimate smoother probability estimation, which is a generalization of Laplace 1000InternationalConferenceonAdvancesinComputing,CommunicationsandInformatics(IC 文档下载 免费文档下载 http://www.51wendang.com/ ACCI-2012) on the Dermatology dataset.The comparative classifier accuracy on the SPECTF-Heart and Mammography Masses dataset is tabulated in Table 2. Table 2. Comparative Classifier Performance on the SPECT-Heart and Mammography Masses Dataset S.No Classification Algorithms Clinical Datasets SPECTF Mammography Accuracy (%) (%) 1234567891011121314151617 Random Tree 10091.36://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ar Quinlan's Figure 4. Performance Comparison of Classification Algorithms on the SPECTF-Heart and Mammography Datasets The precise values of accuracy produced by the classification algorithms on the Orthopaedic and Dermatology datasets are given in Table 3. 文档下载 免费文档下载 http://www.51wendang.com/ Table 3. Performance of Classification Algorithms on Orthopaedic Ailment and Dermatology Infection Datasets Algorithms Clinical Datasets Orthopaedic Accuracy (%) Dermatology Accuracy (%) 12 345 Random Tree 100100100 Quinlan's 618 RB71920 8 9 The graphical representation of the performance of the 10classification algorithms on the SPECTF-Heart and 文档下载 免费文档下载 http://www.51wendang.com/ 100Mammography datasets are shown in Figure 4. The performance 11 comparison is given on the datasets based on the algorithms that 12can be ahttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92pplied on them since the type of attributes and the target values decide the classification algorithm that can be executed. 13The thyroid disease dataset is composed of predictor attributes that are both continuous and discrete-valued. Hence only six 14 classification algorithms could be executed on the dataset. The15parameters referring to the number of attributes selected for split RBin the Random Tree algorithm are set according to the number of 16 attributes to ensure precise classification on all the fivementioned clinical datasets. The accuracy of sixteen classification algorithms on the Orthopaedic ailment and Dermatology infection is graphically displayed in Figure 5. InternationalConferenceonAdvancesinComputing,CommunicationsandInformatics(ICACCI -2012)1001 The graphical representation of classification algorithms on the Thyroid dataset is portrayed in Fihttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92gure 6. 文档下载 免费文档下载 http://www.51wendang.com/ Figure 5. Graphical Representation of Classifier Performance on Orthopaedic and Dermatology Infection Dataset.The comparative classifier accuracy on the Thyroid dataset is displayed in tabular form in Table 4. Table 4. Performance of Classification Algorithms on Thyroid Disease Dataset Algorithm Random Tree 12345 6 Quinlan's C4.5 Accuracy % 100 100 Figure 6. Comparative Classifier Performance on the Thyroid Dataset The size is measured by the number of nodes. The graphical representation of the decision tree size parameters on all the five clinical datasets are represented in Figure 7. Figure 7. Graphical Representation of Decision Tree Size Generated by C4.5 and Random tree Algorithms The size of the decision tree generated by the Random tree and C4.5 algorithm to 文档下载 免费文档下载 http://www.51wendang.com/ classify the medical records ahttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92nd train the classifier are compared and portrayed in Table 5. Table 5. Decision Tree Size Generated by Random Tree and C4.5 Classification AlgorithmsClinical Datasets C4.5 Algorithm Random Tree Algorithm 5.CONCLUSION Data mining applications in the field of medical research has been a challenging task since decades. In this paper, we have made a careful selection of medical datasets containing numerous attributes and multiple examples that will suffice to build a classifier system that will incorporate the process of learning rules and patterns from the training clinical datasets. We have surveyed all the possible classification techniques that could beapplied to the medical datasets and report their respective error UDWHV?? 2XU ILQGLQJV VXJJHVW WKDW WKH 5DQGRP 7UHH DQG 4XLQODQ?V C4.5 algorithm will be suitable for training the classifier in systehttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92m order to for the precisely categorize a new test data. However when it comes to efficiency, the Quinlan's C4.5 algorithm outperforms the Random Tree algorithm by providing the same level of precision in grouping records but with fewer nodes thussaving storage space and 文档下载 免费文档下载 http://www.51wendang.com/ improving comprehensibility. This investigation of classifier performance will certainly make strides in the field of clinical diagnosis, prognosis and prediction No. of nodes leaves nodes leaves 13421819050 479573515129 24 48 37 258 1002InternationalConferenceonAdvancesinComputing,CommunicationsandInformatics(IC ACCI-2012) enabling data mining techniques to provide quality decision making in the health care scenario. 6.ACKNOWLEDGMENTS This research work is a part of the All India Council for Technical Education(AICTE), India funded FOLQLFDO Research Promotion SFKHPH SURMHFW WLWOHG ?(IILFLHQW &ODVVLILHU IRU life datahttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92 (Parkinson, Breast Cancer and P53 mutants) FODVVLILFDWLRQ? ZLWK through IHDWXUH UHOHYDQFH DQDODQG 5HIHUHQFH No:8023/RID/RPS-56/2010-11, No:200-62/FIN/04/05/1624. We would like to acknowledge the UCI Irvine Machine Learning Repository for providing the medical datasets to carry out this research. [13]Jyoti Soni, Ujma Ansari, Dipesh Sharma, Sunita Soni, Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction, International Journal of Computer Applications (0975 ± 8887)Volume 17± No.8, March 文档下载 免费文档下载 http://www.51wendang.com/ 2011 [14]Leo Breiman, Adele Cuttler, Random Trees, http://www.stat.berkeley.edu/users/breiman/RandomForests[15]M. Elter, R. Schulz-Wendtland and T. Wittenberg (2007) The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. Medical Physics 34(11), pp. 4164-4172 [16]Mitchell, T. (1997). Machine Learning, McGrahttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92w Hill. ISBN 0-07-042807-7, [17]Nassif et.al, Information Extraction for Clinical Data Mining: A Mammography Case Study, Appears in Proceedings of the 2009 IEEE International Conference on Data Mining Workshops [18]Prather et.al. Medical Data Mining: Knowledge Discovery in a Clinical Data Warehouse, 1091-8280/97/$5.00 0 (1997) AMIA, Inc. [19]Quinlan, J.R., Compton, P.J., Horn, K.A., & Lazurus, L. (1986). Inductive knowledge acquisition: A case study. In Proceedings of the Second Australian Conference on Applications of Expert Systems. Sydney, Australia. [20]Ron Kohavi and Ross Quinlan, Decision Tree Discovery, October 10, 1999. [21]S.B.Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica (31), 249-268, 2007. [22]Tanagra Data Mining tutorials, http://data-mining-tutorials.blogspot.com/ 文档下载 免费文档下载 http://www.51wendang.com/ [23] Tarannum A. Bloch, Prof. V.B.Vaghehttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92la, Dr.K.H.Wandra, Applied Taxonomy Techniques Intended for Strenuous Random Forest Robustness, Int. J. Comp. Tech. Appl., Vol 2 (6), Nov-Dec, 2011, 2061-2065, ISSN:2229-6093 [24]Ryszard S. Michalski, Jaime G. Carbonell, Tom M. Mitchell (1983), Machine Learning: An Artificial Intelligence Approach, Tioga Publishing Company, ISBN 0-935382-05-4. [25]UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. [26]W. Ressom, Rency S. Varghese, Zhen Zhang, Jianhua Xuan, and Robert Clarke. (2008) Classification Algorithms for phenotype prediction in genomic and Proteomics Front BioScience. [27]Wu. Et.al, Top 10 algorithms in data mining, Knowl Inf Syst (2008) 14:1±37DOI 10.1007/s10115-007-0114-2 [28]Xingquan Zhu, Ian Davidson (2007). Knowledge Discovery and Data Mining: Challenges and Realities. Herhttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92shey, New Your. pp. 31±48. ISBN 978-159904252-7. 7.REFERENCES [1]A. Iavindrasana J, Hidki A, Cohen G, Geissbuhler A, Platon 文档下载 免费文档下载 http://www.51wendang.com/ A, Poletti PA, Müller H.J.2010. Journal of Digit Imaging, Comparative performance analysis of state-of-the-art classification algorithms applied to lung tissue categorization. Depeursinge. (2010 Feb; 23(1):18-30). Epub 2008 Nov 4. [2]A. Kusiak, K.H. Kernstine, J.A. Kern, K.A. McLaughlin, and T.L. Tseng, Data Mining: Medical and Engineering Case Studies, Proceedings of the Industrial Engineering Research 2000 Conference, Cleveland, Ohio, May 21-23, 2000,pp. 1-7. [3]Agresti A (2007). "Building and applying logistic regression models". An Introduction to Categorical Data Analysis. Hoboken, New Jersey: Wiley. p. 138. ISBN 978-0-471-22618-5.?[4]Cios K. J. & Kurgan L. Hybrid Inductive Machine Learning: An Overview of CLIP Algorithms, In: Jain J.http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92 L.C., (Eds). and New Kacprzyk Learning Paradigms in Soft Computing, Physica-Verlag (Springer), 2001 [5]Eugenia G. Giannopoulou, Data Mining in Medical and Biological Research, InTech, November, 2008 ISBN 978-953-7619-30-5 [6]Greene, William H. (2003). Econometric Analysis, fifth edition. Prentice Hall. ISBN 0-13-066189-9. [7]Hilbe, Joseph M. (2009). Logistic Regression Models. Chapman & Hall/CRC Press. ISBN 978-1-4200-7575-5. [8]Ian H. Witten and Eibe Frank Data Mining: Practical machine learning tools and techniques Morgan Kaufmann ISBN 0-12-088407-0. [9]Ian H. Witten; Eibe Frank; Mark A. Hall (30 January 2011). 文档下载 免费文档下载 http://www.51wendang.com/ Data Mining: Practical Machine Learning Tools and Techniques (3 Ed.). Elsevier. ISBN 978-0-12-374856-0. [10]Iavindrasana J et.al, Clinical data mining: a review. Med Inform. 2009:121-33. Review. [11]Irene M. Mullins, et.al, Data mining and clinical data repositories: Insights fromhttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92 a 667,000 patient data set, Elsevier- Computers in Biology and Medicine, August 2005. [12]J. Han and M. Kamber, Data Mining; Concepts and Techniques, Morgan Kaufmann Publishers, 2000. InternationalConferenceonAdvancesinComputing,CommunicationsandInformatics(ICACCI -2012)1003 文档下载网是专业的免费文档搜索与下载网站,提供行业资料,考试资料,教 学课件,学术论文,技术资料,研究报告,工作范文,资格考试,word 文档, 专业文献,应用文书,行业论文等文档搜索与文档下载,是您文档写作和查找 参考资料的必备网站。 文档下载 http://www.51wendang.com/ 亿万文档资料,等你来发现