Download data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
文档下载 免费文档下载
http://www.51wendang.com/
本文档下载自文档下载网,内容可能不完整,您可以点击以下网址继续阅读或下载:
http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92
data mining
Mining of Classification Patterns in Clinical Data through
Data Mining Algorithms
Shomona Gracia Jacob
Ph.D Research Scholar
Rajalakshmi Engineering College Thandalam, Chennai, India. 91-044-26261340,
91-9841242291
R.Geetha Ramani
Professor & Head-Dept.of CSE Rajalakshmi Engineering College Thandalam, Chennai,
India
91-9442891948
[email protected] [email protected]
Categories and Subject Descriptors
I.5 [Pattern Recognition]: Design Methodology ±classifier design and evaluation.
文档下载 免费文档下载
http://www.51wendang.com/
ABSTRACT
Data mining on clinical data is a challenging area in the field of medical research,
aiming at predicting and discovering patterns of disease occurrence and prognosis
based on detected symptoms and reported health conditions. Data mining is the process
of recovering related, significant and imperative information from a copious
collection
of
comprehensive
data.
The
main
focus
thttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92o
of
this
highlight
paper
is
the
significance of machine learning and data mining techniques on classifying clinical
datasets downloaded from the University of California, Irvine (UCI) Machine Learning
Repository viz, Mammography masses, Orthopaedic (Vertebral Column) ailments,
Dermatology infection, SPECTF (Single Proton Emission Computed Tomography) Heart and
Thyroid diseases. We have made a careful selection of clinical data from various
domains in order to identify the performance of the data mining algorithms on
different types of clinical datasets. Our research work incorporates the design of
a data mining framework that generates an efficient classifier that is trained on
all the clinical datasets stated, by learning patterns and rules framed by executing
classification algorithms. This enables the formulation of precise and accurate
decisions to classify an unseen medical test data in the related field. Our results
validate
and
confirm
4http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92XLQODQ?V
WKDW
&??????
FODVVLILFDWLRQ DOJRULWKP and the Random Tree algorithm yield 100 percent
classification accuracy on SPECTF Heart, Orthopaedic (Vertebral Column) ailments,
Thyroid and Dermatology infection datasets while Binary Logistic Regression and
CS-MC4 also give 100 percent classifier accuracy on the SPECTF Heart Dataset and
Multinomial Logistic Regression too classifies the Dermatology dataset with
100percent accuracy. However on the Mammography training dataset, the classification
accuracy produced by Random Tree DQG 4XLQODQ?V &?????? LV RQO????????????. We modify
the parameters that control the decision tree size and the confidence level to achieve
100 percent classifier accuracy.
The Decision tree JHQHUDWHG EWKH 4XLQODQ?V &??????
文档下载 免费文档下载
http://www.51wendang.com/
DOJRULWKP LV VPDOOHU WKDQ WKH decision tree given by the Random Tree classification
technique. Our research states that the Quinlan's C4.5 algorithm is mostefficient
in
building
a
classifier
thttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92rained on clinical datasets
from diverse domains that could provide 100 percent classifier accuracy on a
previously unseen test data.
General Terms
Algorithms, Performance, Design
Keywords
Data Mining, Machine Learning, Classification, Clinical data
1.INTRODUCTION
Data mining (Eugenia, 2008) is a process of retrieving consequential and imperative
information from an exhaustive collection of data. Data mining tools (Witten, 2011a,
2011b) predict future patterns and possible trends that enable users to make
knowledge-based and knowledge-driven decisions. They search records for concealed,
hidden information in order to provide extrapolative information that experts may
otherwise overlook. Machine intelligence (Mitchell, 1983; S.B.Kotsiantis, 2007) is
one of the core components of data mining that involvesa study and analysis of computer
algorithms
and
techniques
that
tohttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92
are
automatically
expected
improve
with experience. One ofthe key application areas in the field of machine learning
(Mitchell, 1997) explores data mining programs that discover general rules in
sizeable datasets. One of the major application and research area of machine learning
is the improvement of classifier accuracy by learning. Classification (Iavindrasana,
文档下载 免费文档下载
http://www.51wendang.com/
2008) is a data mining technique that is derived from the concept of machine learning.
In classification our goal is to find a model for the class or target attribute as
a function of other predictor attributes. In this research work we have analyzed the
performance of classification algorithms in performing both binary and multi-class
classification. In binary classification (Wu, 2008), the target class can have only
two possible values whereas in the latter, the target attribute can have
multiplevalues.
Clinical
data
mining
(Iavindrasana,
2009)
is
committed
to
retrieving
novehttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92l and previously unknown
information from medical records and databases. Classification (Ressom, 2008) is the
most commonly applied data mining function on clinical datasets. Application of
technology in the field of medicine will certainly advance the current state of
disease prediction and prognosis. Early and precise prediction and classification
of ailments is expected to aid researchers in designing drugs that
will prevent and arrest the progress of life-threatening ailments
997
and provide hope to a large section of the ailing population. Inthis paper we bring
together classification techniques to discover patterns in medical records and
highlight the performance of the training models that will enable more precise malady
prediction and classification of ailments.
In this paper we apply nearly twenty classification algorithms viz.?? 4XLQODQ?V
&????????
Random
Tree,
Binary
Logistic
Regression
http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92(BLR), Multinomial Logistic
Regression(MLR), Partial Least Squares for Classification (C-PLS), Classification
Tree(C-RT), Cost-Sensitive Classification Tree(CS-CRT), Cost-sensitive Decision
文档下载 免费文档下载
http://www.51wendang.com/
Tree algorithm(CS-MC4), SVM for classification(C-SVC), Iterative Dichomotiser(ID3),
K-Nearest
Neighbor(K-NN),
Linear
Discriminant
Analysis
(LDA),
Logistic
Regression(Log-Reg), Multilayer Perceptron(MP), Na?ve Bayes Classifier(NBC),
Partial Least Squares -Discriminant/Linear Discriminant Analysis(PLS-DA/LDA),
Prototype-Nearest Neighbor(P-NN),Radial Basis Function (RBF), and Support Vector
Machine(SVM) classification algorithms on the training clinical datasets and
evaluate the performance of the algorithms based on their error-rates, accuracy and
decision tree size.
Prather et.al, (1997) have made a complete study on the data mining techniques
involved in knowledge discovery from medical databases. However their interests were
focused
mainhttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ly
on
data
warehousing, data cleaning and data analysis. They have collected around 3902 patient
records relating to Obstetrics and this data was evaluated by means of exploratory
factor analysis technique for identification of factors contributing to pretermbirth
of the foetus. Three factors were detected by the explorators for further
investigation. Nassif et.al, (2009) et.al has described a concept information
extraction technique given a lexicon and a BI-RADS feature extraction algorithm for
clinical data mining on the mammography dataset alone. However the number of records
is limited to 100. They present a BIRADS features extraction algorithm for clinical
data mining. They report that their algorithm achieves 97.7% precision, 95.5% recall
and an F-score of 0.97. It is said to outperform manual feature extraction at the
5% statistical significance level. It particularly achieves a high recall gain over
manual indexing.
http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92The
proposed
data
framework in our paper is portrayed in detail in the following section.
1.1Paper Organization
mining
文档下载 免费文档下载
http://www.51wendang.com/
Section 2 briefs about the related work in the area of data mining while Section 3
describes the proposed system. Section 4 substantiates our findings with necessary
results and Section 5 concludes the paper.
3.SYSTEM DESIGN
The data mining (Han and Kamber, 2000) framework designed to generate an efficient
classifier is clearly described in the following sub-sections. The diagrammatic
representation of the data mining framework proposed in our paper is displayed in
Figure 1.
2.BACKGROUND OF THE WORK
Previous research findings in the area of clinical data mining and knowledge discovery
are briefly reported in the following paragraphs.
Bloch et.al (Bloch, 2011) has evaluated the performance of classifiers viz, J48,
Random
Forest,
Na?ve
Bayes,
AdaBohttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ost M1 and Bagging on
the datasets viz, Labor, Iris, Vertebral Column, Ionosphere, Dermatology and Car.
They have reported 100% accuracy for Random Forest Classifier on all the datasets.
Moreover they have evaluated the classifier performance using ROC values and Kappa
Statistics in Weka Data mining tool. Soni et.al, (March, 2011) have performed an
analysis of data mining techniques using Tanagra. But they have made a complete study
of only the heart disease dataset and have provided a survey on classifier performance
with the result from cross-fold validation. Their results however do not report 100
percent accuracy. Kusiak et.al (2000) has discussed the problem of predicting
outcomes in medical and engineering datasets. Theyhave discussed the performance of
only two approaches viz, rough set theory and decision tree generation. They report
文档下载 免费文档下载
http://www.51wendang.com/
100 percent prediction accuracy for the medical datasets but revealed only 98.6%
accuracy on http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92the engineering
data.
Mullins et.al, (Mullins et.al,2005) applied a new data mining WHFKQLTXH QDPHG ?
HDOWKPLQHU? WR D ODUJH FRKRUW RI ?????????????? inpatient and outpatient records from
an academic digital system.HealthMiner approaches knowledge discovery using three
unsupervised rule discovery methods: CliniMiner, Predictive Analysis, and Pattern
Discovery. They tabulated the results for data trend characterization, discovery of
medically known/unknown co-relations and identification of data anomalies using all
the three unsupervised methods. Their results suggest that unsupervised data mining
of large clinical repositories is feasible.
Figure 1. Data Mining Framework for Design of Efficient
Classifier The proposed system comprises of the Training data Selection Phase, Data
Pre-processing followed by the Classification Phase and the Evaluation phase that
selects the best classifier. After
998Inthttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ernationalConferenceo
nAdvancesinComputing,CommunicationsandInformatics(ICACCI-2012)
choosing the best classifier, the test data is loaded to verify the accuracy of the
selected classifier.
Table 1.
Description of Clinical Datasets
No. Domain predictor instances attributes
9612672800310
文档下载 免费文档下载
http://www.51wendang.com/
1-4, years Values 0 to 100
True/FContinous values (numbers) 0-3
3.1Training Datasets
The medical datasets that have been utilized to train the classifier have been taken
from a broad range of medical ailments, each dataset having different types of data
that include numbers, text, and domain of values. Hence the classification algorithm
performing well on each of these datasets will definitely aid in predicting disease
occurrence patterns as well as provide scope for further discovery of unknown trends
existing
in
the
stored
medical
records.
Each
of
the
datasets,
attributes,http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92
number
their
of
training examples and possible values has been clearly portrayed in tabular form in
Table 1. These clinicaldatasets have been downloaded from the University of
California, Irvine Machine Learning Repository (UCI, Irvine) to carry out this
research work.
42.
SPECT-Heart
disease
44
286
abnormality 5.
文档下载 免费文档下载
http://www.51wendang.com/
Dermatology infection
33
366
The Dermatology database is concerned with the differential diagnosis of
Erythemato-squamous diseases that shares the clinical features of erythema and
scaling, with very little 3.2 Data Pre-processing differences. The diseases in this
group are Psoriasis, Seboreic The datasets are explored individually since each
dataset targets Dermatitis, Lichen Planus, Pityriasis Rosea, Chronic Dermatitis, a
unique ailment, has varying number of records and the values and Pityriasis Rubra
Pilaris.
The
values
of
histopathhttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ological
the
are
discrete or continuous according to the nature of the attribute features are
determined by an analysis of the samples under a predicting the specific disease.
Moreover the target values of the microscope. The Thyroid data comprises of continuous
and class label too vary according to the particular malady. Hence the discrete valued
attributes based on the Thyroxine levels and the datasets available in text and .arff
formats are imported into hormones T3, T4 and TSH. The classification may be a
negativeExcel spreadsheet with the column headers clearly indicating the case or a
discordant case. The SPECTF-Heart dataset describes predictor attributes and the
specific target attribute. Missing diagnosing of cardiac Single Proton Emission
Computed values are appropriately replaced with related values. The Excel Tomography
(SPECT) images. Each of the patients is classified spreadsheet is then loaded into
Tanagra, a Data Mining Thttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ool
into two categories: normal and abnormal. The database of 267 (Tanagra). The
attributes are specified and the loaded data is SPECT image sets (patients) was
processed to extract 44 features visualized for verification. Once the data is found
文档下载 免费文档下载
http://www.51wendang.com/
to be precisely that summarize the original SPECT images available at therecorded,
we proceed with classification as explained in the public data repository ±UCI Irvine
Machine Learning Repository. following section. This is named as SPECTF in the
repository to differentiate it from the SPECT data set that is further processed to
reduce the 3.3Classification attributes to 22. The Mammography masses data set can
be used The main objective of classification is to accurately predict the to predict
the severity (benign or malignant) of a mammographic target class for each record.
The training process of classification mass lesion from BI-RADS attributes and the
patient's
age.
It
attempts
discohttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ver
to
relationships
between the predictor and thecontains a BI-RADS assessment, the patient's age and
three BI-target values. Classification algorithms (Elter, 2007) differ in the
RADS attributes
techniques they use to determine these relationships. These
together with the severity field on full field digital mammograms
relationships are
further recapitulated in a model which is then collected at the Institute of Radiology
of the
applied to a record (test data) where the class label is unknown. University
Erlangen-Nuremberg between 2003 and 2006.
algorithms in our survey on These can be an
The best performing classification
the datasets using the Tanagra data mining
tool are briefly indication of how well a CAD system performs compared to the explained
in the following sub-sections. radiologists.
In the Orhopaedic (Vertebral Column)
dataset each patient is represented in the data set by six biomechanical
3.3.1Quhttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92inlans C4.5 Algorithm
attributes derived from the shape and orientation of the pelvis
C4.5 is an algorithm used to generate a decision tree and
and lumbar spine. They can be classified as normal and abnormal
文档下载 免费文档下载
http://www.51wendang.com/
was developed by Ross Quinlan (Kohavi and Quinlan, 1999).
categories.
The training data is a set S=S1?? 6???of already classified samples. Each sample Si=k1,
k2,.. is a vector where k1, k2, represent attributes or features of the sample. The
training data is augmented with a vector C=C1, C2, where C1, C2represent the
Permission to make digital or hard copies of all or part of this work for class to
which each sample belongs. At each node of the tree,personal or classroom use is
granted without fee provided that copies are
C4.5 chooses one attribute of the data that is most effective in
not
made
or
distributed
for
profit
or
commercial
advantage
and
that
copieshttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92
splitting the set of samples into subsets enriched in one class or bear this notice
and the full citation on the first page. To copy otherwise, or
the other (Quinlan, 1986). Its criterion is the normalized republish, to post on
servers or to redistribute to lists, requires prior specific
permission and/or a fee. information gain that results from choosing an attribute
forICACCI
'12,
August
03
-
05
2012,
CHENNAI,
India
Copyright
2012
ACM
978-1-4503-1196-?????????????????????????????
InternationalConferenceonAdvancesinComputing,CommunicationsandInformatics(ICACCI
-2012)999
文档下载 免费文档下载
http://www.51wendang.com/
splitting the data. Sample rules generated by the C4.5 algorithm for SPECTF heart
dataset are given in Figure 2.
F21R =60.5000
F20S
F21R
F21R >= 77.5000
then
DIAGNOSIS= ABNORMAL Figure 2. Sample Classification Rules
generated by the C4.5
Algorithm
on
the
SPECTF-
Heart
Dataset
The
atthttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ribute with the highest
normalized information gain is chosen to make the decision. The C4.5 algorithm then
recurseson the smaller sub lists.
estimate. It minimizes the expected loss using Misclassification matrix for the
detection of the best prediction within leaves (Tanagra tutorials). The target
attribute must be discrete in nature while the predictors may be continuous or
discrete valued.
3.4Evaluation Phase
The best performing efficient classifier is chosen based on the produced
Misclassification Rates and the Decision tree sizegenerated by the respective
algorithms. The details of the classifier results are clearly outlined in Section
4. The Random Tree algorithm, C4.5 algorithm, CS-MC4, Binary Logistic Regression and
the Multinomial Logistic Regression algorithms have proved to be most accurate as
substantiated by the resultsobtained in our classifier performance analysis.
3.3.2Random Tree http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92Algorithm
文档下载 免费文档下载
http://www.51wendang.com/
The Random trees have been introduced by Leo Breiman and Adele Cutler. Random trees
are a collection (ensemble) of tree predictors that is called forest. The
classification works as follows: the random trees classifier takes the input feature
vector, classifies it with every tree in the forest, and outputs the class ODEHO WKDW
UHFHLYHG WKH PDMRULWRI ?YRWHV? (Ressom, 2008). In the case of regression the
classifier response is the average of the responses over all the trees in the forest.
In most machinelearning algorithms, the best approximation to the target function
LV DVVXPHG WR EH WKH ?VLPSOHVW? FODVVLILHU WKDW ILWV WKH JLYHQ GDWD?? since more
complex models tend to over fit the training data and generalize poorly (Tanagra
tutorials). Sample rules generated by the Random Tree algorithm for SPECTF heart
dataset are given in Figure 3.
F22S
F20S= 68.5000
F19R
3.5Test Phase
://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92The test phase is necessary to
corroborate the classifier accuracy. Consequently we present the classifier with a
clinical test data, one from each kind of ailment on which the classifier was trained
and validate the precision with which the classification is made. The test data was
correctly classified for each category of ailment covered in this paper.
The performance of the classifier is comprehensively dealt with in the following
section.
4.PERFORMANCE EVALUATION
The classification techniques that have been analyzed on the medical datasets are
文档下载 免费文档下载
http://www.51wendang.com/
graded based on certain performance measures that include Error- rate, Accuracy and
Decision Tree size. They are succinctly presented in the following sub-section.
4.1Measures of Performance
The best performing efficient classifier is chosen based on the following measures:
4.1.1Accuracy
The
accuracy
(Han
and
Kamber,
2000)
of
http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92a classifier on a given test
set is the percentage of test set tuples that are correctly classified by the
classifier.
3.3.3Multinomial Logistic Regression
Multinomial logistic regression (William, 2003; Hilbe, 2009) is a regression model
which generalizes logistic regression by allowing more than two discrete outcomes.
This model is used to predict the probabilities of the different possible outcomes
of a categorically distributed dependent variable, given a set of independent
variables (Tanagra tutorials). The variables may bereal-valued, binary-valued,
categorical-valued, etc.., .The multinomial logistic model assumes the data to be
case specific, and that is, each independent variable has a single value for eachcase.
The multinomial logistic model also assumes that thedependent variable cannot be
perfectly predicted from the independent variables for any case. Here co linearity
(Agresti,
2007;
Cios,
2001)
is
asshttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92umed to be relatively low,
as it becomes difficult to differentiate between the impacts of several variables
if they are highly correlated.
文档下载 免费文档下载
http://www.51wendang.com/
4.1.2Error-Rate
Also called the Misclassification rate, it is measured as 1-Acc (M), where Acc (M)
is the accuracy of M.
4.1.3Decision Tree Size
The size of the decision tree (Zhu, 2007; Kohavi and Quinlan, 1999) is specified by
the number of nodes. The tree which is able to predict the correct class label with
the smallest number of nodes is taken to be the efficient classifier.
4.2Experimental Results
The experimental results for the twenty classification algorithms on the medical
datasets are clearly outlined in the followingsection.
7KH 5DQGRP 7UHH FODVVLILFDWLRQ DOJRULWKP DQG WKH 4XLQODQ?V C4.5 algorithm produce
100 percent accuracy on the SPECTF Heart dataset, Orthopaedic ailment, Dermatology
and
Thyroid
disease
dataset.
http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92Logistic
Binary
Regression
and
CS-MC4 produce 100 percent accuracy on the SPECTF Heart dataset and Multinomial
Logistic Regression produce 100 percent accuracy
3.3.4Cost Sensitive-Misclassification Cost Matrix (CS-MC4)
Cost-Sensitive classification algorithm (Kotsiantis, 2007; Wu, 2008) is similar to
C4.5 but uses m-estimate smoother probability estimation, which is a generalization
of Laplace
1000InternationalConferenceonAdvancesinComputing,CommunicationsandInformatics(IC
文档下载 免费文档下载
http://www.51wendang.com/
ACCI-2012)
on the Dermatology dataset.The comparative classifier accuracy on the SPECTF-Heart
and Mammography Masses dataset is tabulated in Table 2.
Table 2.
Comparative Classifier Performance on the SPECT-Heart and Mammography
Masses Dataset S.No Classification
Algorithms
Clinical Datasets
SPECTF Mammography
Accuracy (%) (%)
1234567891011121314151617
Random Tree 10091.36://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92ar
Quinlan's
Figure 4.
Performance Comparison of Classification
Algorithms on the SPECTF-Heart and Mammography
Datasets
The precise values of accuracy produced by the classification algorithms on the
Orthopaedic and Dermatology datasets are given in Table 3.
文档下载 免费文档下载
http://www.51wendang.com/
Table 3.
Performance of Classification Algorithms on
Orthopaedic Ailment and Dermatology Infection Datasets Algorithms Clinical Datasets
Orthopaedic
Accuracy (%)
Dermatology Accuracy (%)
12
345
Random Tree
100100100
Quinlan's
618
RB71920
8
9
The graphical representation of the performance of the 10classification algorithms
on the SPECTF-Heart and
文档下载 免费文档下载
http://www.51wendang.com/
100Mammography datasets are shown in Figure 4. The performance 11
comparison is given on the datasets based on the algorithms that
12can be ahttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92pplied on them
since the type of attributes and the target
values decide the classification algorithm that can be executed. 13The thyroid
disease dataset is composed of predictor attributes
that are both continuous and discrete-valued. Hence only six 14
classification algorithms could be executed on the dataset. The15parameters
referring to the number of attributes selected for split
RBin the Random Tree algorithm are set according to the number of 16
attributes to ensure precise classification on all the fivementioned clinical
datasets.
The accuracy of sixteen classification algorithms on the Orthopaedic ailment and
Dermatology infection is graphically displayed in Figure 5.
InternationalConferenceonAdvancesinComputing,CommunicationsandInformatics(ICACCI
-2012)1001
The graphical representation of classification algorithms on the Thyroid dataset is
portrayed in Fihttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92gure 6.
文档下载 免费文档下载
http://www.51wendang.com/
Figure 5.
Graphical Representation of Classifier Performance on Orthopaedic and
Dermatology Infection
Dataset.The comparative classifier accuracy on the Thyroid dataset is displayed in
tabular form in Table 4.
Table 4.
Performance of Classification Algorithms on
Thyroid Disease Dataset Algorithm
Random Tree 12345
6
Quinlan's C4.5
Accuracy %
100 100
Figure 6.
Comparative Classifier Performance on the
Thyroid Dataset The size is measured by the number of nodes. The graphical
representation of the decision tree size parameters on all the five clinical datasets
are represented in Figure 7.
Figure 7.
Graphical Representation of Decision Tree Size
Generated by C4.5 and Random tree Algorithms
The size of the decision tree generated by the Random tree and C4.5 algorithm to
文档下载 免费文档下载
http://www.51wendang.com/
classify
the
medical
records
ahttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92nd train the classifier are
compared and portrayed in Table 5.
Table 5.
Decision Tree Size Generated by Random Tree and
C4.5 Classification AlgorithmsClinical Datasets
C4.5
Algorithm
Random Tree
Algorithm
5.CONCLUSION
Data mining applications in the field of medical research has been a challenging task
since decades. In this paper, we have made a careful selection of medical datasets
containing numerous attributes and multiple examples that will suffice to build a
classifier system that will incorporate the process of learning rules and patterns
from the training clinical datasets. We have surveyed all the possible classification
techniques that could beapplied to the medical datasets and report their respective
error UDWHV?? 2XU ILQGLQJV VXJJHVW WKDW WKH 5DQGRP 7UHH DQG 4XLQODQ?V C4.5 algorithm
will
be
suitable
for
training
the
classifier
in
systehttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92m
order
to
for
the
precisely
categorize a new test data. However when it comes to efficiency, the Quinlan's C4.5
algorithm outperforms the Random Tree algorithm by providing the same level of
precision in grouping records but with fewer nodes thussaving storage space and
文档下载 免费文档下载
http://www.51wendang.com/
improving comprehensibility. This investigation of classifier performance will
certainly make strides in the field of clinical diagnosis, prognosis and prediction
No. of nodes leaves nodes leaves
13421819050
479573515129
24
48
37
258
1002InternationalConferenceonAdvancesinComputing,CommunicationsandInformatics(IC
ACCI-2012)
enabling data mining techniques to provide quality decision making in the health care
scenario.
6.ACKNOWLEDGMENTS
This research work is a part of the All India Council for Technical Education(AICTE),
India funded
FOLQLFDO
Research Promotion SFKHPH SURMHFW WLWOHG ?(IILFLHQW &ODVVLILHU IRU
life
datahttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92
(Parkinson, Breast Cancer and P53 mutants)
FODVVLILFDWLRQ?
ZLWK
through IHDWXUH UHOHYDQFH DQDODQG
5HIHUHQFH
No:8023/RID/RPS-56/2010-11,
No:200-62/FIN/04/05/1624. We would like to acknowledge the UCI Irvine Machine
Learning Repository for providing the medical datasets to carry out this research.
[13]Jyoti Soni, Ujma Ansari, Dipesh Sharma, Sunita Soni,
Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction,
International Journal of Computer Applications (0975 ± 8887)Volume 17± No.8, March
文档下载 免费文档下载
http://www.51wendang.com/
2011 [14]Leo Breiman, Adele Cuttler, Random Trees,
http://www.stat.berkeley.edu/users/breiman/RandomForests[15]M.
Elter,
R.
Schulz-Wendtland and T. Wittenberg (2007)
The prediction of breast cancer biopsy outcomes using two CAD approaches that both
emphasize an intelligible decision process. Medical Physics 34(11), pp. 4164-4172
[16]Mitchell,
T.
(1997).
Machine
Learning,
McGrahttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92w Hill. ISBN
0-07-042807-7,
[17]Nassif et.al, Information Extraction for Clinical Data
Mining: A Mammography Case Study, Appears in Proceedings of the 2009 IEEE
International Conference on Data Mining Workshops [18]Prather et.al. Medical Data
Mining: Knowledge Discovery
in a Clinical Data Warehouse, 1091-8280/97/$5.00 0 (1997) AMIA, Inc. [19]Quinlan,
J.R., Compton, P.J., Horn, K.A., & Lazurus, L.
(1986). Inductive knowledge acquisition: A case study. In Proceedings of the Second
Australian Conference on Applications of Expert Systems. Sydney, Australia. [20]Ron
Kohavi and Ross Quinlan, Decision Tree Discovery,
October 10, 1999. [21]S.B.Kotsiantis, Supervised Machine Learning: A Review of
Classification Techniques, Informatica (31), 249-268, 2007. [22]Tanagra Data Mining
tutorials,
http://data-mining-tutorials.blogspot.com/
文档下载 免费文档下载
http://www.51wendang.com/
[23]
Tarannum
A.
Bloch,
Prof.
V.B.Vaghehttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92la,
Dr.K.H.Wandra,
Applied Taxonomy Techniques Intended for Strenuous Random Forest Robustness, Int.
J. Comp. Tech. Appl., Vol 2 (6), Nov-Dec, 2011, 2061-2065, ISSN:2229-6093 [24]Ryszard
S. Michalski, Jaime G. Carbonell, Tom M.
Mitchell (1983), Machine Learning: An Artificial Intelligence Approach, Tioga
Publishing Company, ISBN 0-935382-05-4. [25]UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
[26]W. Ressom, Rency S. Varghese, Zhen Zhang,
Jianhua Xuan,
and Robert Clarke. (2008) Classification Algorithms for phenotype prediction in
genomic and Proteomics Front BioScience. [27]Wu. Et.al, Top 10 algorithms in data
mining, Knowl Inf Syst
(2008) 14:1±37DOI 10.1007/s10115-007-0114-2 [28]Xingquan Zhu, Ian Davidson (2007).
Knowledge Discovery
and
Data
Mining:
Challenges
and
Realities.
Herhttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92shey, New Your. pp. 31±48.
ISBN 978-159904252-7.
7.REFERENCES
[1]A. Iavindrasana J, Hidki A, Cohen G, Geissbuhler A, Platon
文档下载 免费文档下载
http://www.51wendang.com/
A, Poletti PA, Müller H.J.2010. Journal of Digit Imaging, Comparative performance
analysis of state-of-the-art classification algorithms applied to lung tissue
categorization. Depeursinge. (2010 Feb; 23(1):18-30). Epub 2008 Nov 4. [2]A. Kusiak,
K.H. Kernstine, J.A. Kern, K.A. McLaughlin,
and T.L. Tseng,
Data Mining: Medical and Engineering Case Studies, Proceedings of
the Industrial Engineering Research 2000 Conference, Cleveland, Ohio, May 21-23,
2000,pp. 1-7. [3]Agresti A (2007). "Building and applying logistic regression
models". An Introduction to Categorical Data Analysis. Hoboken, New Jersey: Wiley.
p. 138. ISBN 978-0-471-22618-5.?[4]Cios K. J. & Kurgan L. Hybrid Inductive Machine
Learning:
An
Overview
of
CLIP
Algorithms,
In:
Jain
J.http://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92
L.C.,
(Eds).
and
New
Kacprzyk
Learning
Paradigms in Soft Computing, Physica-Verlag (Springer), 2001 [5]Eugenia G.
Giannopoulou, Data Mining in Medical and
Biological Research, InTech, November, 2008 ISBN 978-953-7619-30-5 [6]Greene,
William H. (2003). Econometric Analysis, fifth
edition. Prentice Hall. ISBN 0-13-066189-9. [7]Hilbe, Joseph M. (2009). Logistic
Regression Models.
Chapman & Hall/CRC Press. ISBN 978-1-4200-7575-5. [8]Ian H. Witten and Eibe Frank
Data Mining: Practical
machine learning tools and techniques Morgan Kaufmann ISBN 0-12-088407-0. [9]Ian H.
Witten; Eibe Frank; Mark A. Hall (30 January 2011).
文档下载 免费文档下载
http://www.51wendang.com/
Data Mining: Practical Machine Learning Tools and Techniques (3 Ed.). Elsevier. ISBN
978-0-12-374856-0. [10]Iavindrasana J et.al, Clinical data mining: a review. Med
Inform. 2009:121-33. Review. [11]Irene M. Mullins, et.al, Data mining and clinical
data
repositories: Insights fromhttp://www.51wendang.com/doc/17265486f9ceaa4cf7f6ae92
a 667,000 patient data set, Elsevier- Computers in Biology and Medicine, August 2005.
[12]J. Han and M. Kamber, Data Mining; Concepts and
Techniques, Morgan Kaufmann Publishers, 2000.
InternationalConferenceonAdvancesinComputing,CommunicationsandInformatics(ICACCI
-2012)1003
文档下载网是专业的免费文档搜索与下载网站,提供行业资料,考试资料,教
学课件,学术论文,技术资料,研究报告,工作范文,资格考试,word 文档,
专业文献,应用文书,行业论文等文档搜索与文档下载,是您文档写作和查找
参考资料的必备网站。
文档下载 http://www.51wendang.com/
亿万文档资料,等你来发现