Download Data Mining Discretization Methods and Performances (PDF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Visual Turing Test wikipedia , lookup

Genetic algorithm wikipedia , lookup

Machine learning wikipedia , lookup

Granular computing wikipedia , lookup

Pattern recognition wikipedia , lookup

Incomplete Nature wikipedia , lookup

Time series wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Proceedings of the International Conference on
Electrical Engineering and Informatics
Institut Teknologi Bandung, Indonesia June 17-19, 2007
B-28
Data Mining Discretization Methods and Performances
Z. Marzuki, F. Ahmad
Faculty of Information Technology, Universiti Utara Malaysia, 06010 UUM Sintok, Kedah, Malaysia
Phone: (60)-04-9284722; Fax (60)-04-9284753
Email: {zaharin, fudz}@uum.edu.my
Discretization process is known to be one of the most important data preprocessing tasks in data mining. Presently, many
discretization methods are available. These include Boolean Reasoning, Equal Frequency Binning, Entropy, and others.
Each method is developed for specific problems or domain area. In consequent, the usage of such methods in other areas
might not be appropriate. In appropriately used of a technique will cause serious problem to happen in which important
data is lost. This will cause inaccuracy of results and unreliable models be produced. This study attempts to evaluate the
performances of various discretization methods on several domain areas. The experiments have been validated using 10fold cross validation method. The ranking of the performances of the methods have been discovered from the
experiments. The results suggest that different discretization methods perform better in one or more domain areas.
.
1. Introduction
In data mining, discretization process is known to be
one of the most important data preprocessing tasks. Most
of the existing machine learning algorithms are capable of
extracting knowledge from databases that store discrete
attributes. If the attribute are continuous, the algorithms can
be integrated with a discretization algorithms which
transform them into discrete feature.
Discretization methods are used to reduce the number
of values for a given continuous attributes by dividing the
range of the attribute into intervals (5)(2). Discretization
makes learning more accurate and faster. The results
(decision tree, induction rules) of the process are usually
more compact, shorter and more accurate when discrete
features are used compared to continuous features (7)(8).
Furthermore, according to (3), “the most important
performance criterion of the discretization method is the
accuracy rate”.
This paper begins with a brief description about the
importance of discretization method in data mining. In
Section 2., the discretization process’s steps are listed and
the chosen methods are described. Subsequent section
explains about the experimental evaluation conducted in
this study, and the results obtained are also presented.
Discussion on the findings and concluding remarks follow
in Section 4.
2. Discretization Process
A normal discretization process specifically consists of
four steps (i) sort all the continuous values of the feature to
be discretized (ii) choose a cut point to split the continuous
values into intervals. (iii) split or merge the intervals of
continuous values (iv) choose the stopping criteria of the
discretization process (8).
In order to carry out the process, discretization method
has to be applied. The following subsections describe about
different discretization methods used in this study, namely
the Boolean Reasoning, Equal Frequency Binning, Entropy
Minimum Description Length (Entropy/MDL), Naïve and
Semi Naïve.
2.1 Boolean Reasoning
Outlined by (10), a straight forward implementation of
the algorithm that combines the cuts found by Boolean
Reasoning procedure for discarding all but a small subset
of attribute value that does not preserve the discernibility.
The remaining subset is a minimal set of cuts preserves the
discernibility inherent in the dataset. The algorithm
operates by first creating a Boolean function f from the set
of candidate cuts, and then computing a prime
implicant of this function (9).
2.2 Equal Frequency Binning
Equal frequency binning is a simple unsupervised and
univariate discretization algorithm which discretizes a
continuous valued attributes by creating a specified number
of bins. The number of intervals is used to determine the
number of bins. Each bin is associated with a distinct
discrete value where an equal number of continuous values
will be assigned into each bin (8).
2.3 Entropy/MDL
According to (7), Entropy/MDL is an algorithm that
recursively partitioned the value set of each attribute so that
the local measure of entropy is optimized. In this
algorithm, the minimum description length principle
defines a stopping criterion for the partitioning process.
2.4 Naïve
Naïve algorithm takes both condition attributes and
decision attributes into consideration (1). The algorithm
ISBN 978-979-16338-0-2
sorts the condition attributes first, then considers a cut point
between two values of each attribute.
535
Proceedings of the International Conference on
Electrical Engineering and Informatics
Institut Teknologi Bandung, Indonesia June 17-19, 2007
2.5 Semi Naïve
Functionally similar to naïve but has more logic to
handle the case where value-neighboring objects belong to
different decision classes
3. Experimental Evaluation
3.1 The Datasets
The datasets were obtained from the UCI Repository
(4). Datasets from two different areas, medical and
engineering, were selected. These datasets vary in size.
From each area, four different datasets are selected for the
process. During each process, five discretization methods
are applied on the datasets. Summary of the datasets are
shown in Table 1and Table 2.
Table 1. The medical datasets for experiment
Dataset
Class
Ins
Cont
Attr
Winconsin
2
699
9
9
breast cancer
(wbc)
Hepatitis(hep) 2
155
6
20
Lung Cancer
3
32
57
(lung)
Lympho
4
148
19
19
(lym)
Table 2. The engineering datasets for experiment
Dataset
Class
Ins
Cont
Attr
Ionosphere
2
351
34
34
(ion)
Image
7
210
19
19
Segmentation
(image)
Relative CPU 8
209
8
10
Performance
(cpu)
Auto-Mpg
many
398
5
9
(auto)
B-28
iteration. The overall results of the experiment are
presented in the next section.
3.3 Experimental Results
Table 3 and Table 4 show the experimental results for
Boolean Reasoning, Equal Frequency Binning, Entropy,
Naïve and Semi Naïve discretization methods for medical
and engineering datasets respectively. The last row of each
table gives the average accuracies for each discretization
methods for the dataset.
Table 3. Result of the classification accuracy % of medical
dataset
Discretization Methods
Datasets Boolean
Equal
Entropy
Naïve
Reasoning Frequency
wbc
81
82
85
90
hep
89
81
48
85
lung
79
73
79
75
lym
94
50
90
93
AVG
85.8
71.5
75.5
85.8
Table 4. Result of the classification accuracy % of
engineering dataset
Discretization Methods
Datasets
Boolean
Equal
Entropy
Naïve
Reasoning Frequency
ion
39
93
76
93
image
20
71
68
78
cpu
0.7
1.6
0.6
1.7
auto
1.9
1.8
0.6
3.1
AVG
15.4
41.9
36.3
44.0
Figure 1 and Figure 2 represent the classification
accuracy for various numbers of classes of medical and
engineering datasets.
100
3.2 Experimental Design
80
During the experimental design, the missing values in
the datasets must be handled. In this study, this step is
accomplished using SPSS. Next, each dataset is discretized
using the following discretization methods namely Boolean
Reasoning, Equal Frequency Binning, Entropy, Naïve and
Semi Naïve.
For training and testing purposes, the split factor 0.2 has
been used to partition the datasets. Genetic Algorithm is
used to generate reducts and rules for all datasets. For each
dataset 10-fold cross validation is performed to train and
test for Standard Voting classifier. The results of these
steps are measured based on the performance of the
average percentage in correct classification across all
60
ISBN 978-979-16338-0-2
BoolRea
EqualFreq
Entropy
40
Naïve
20
0
2
3
4
SemiNaïve
Figure 1: Classification accuracy obtained for various
numbers of classes of medical dataset
536
Proceedings of the International Conference on
Electrical Engineering and Informatics
Institut Teknologi Bandung, Indonesia June 17-19, 2007
% accuracy
100
80
BooleRea
60
EqualFreq
Entropy
40
Naïve
20
Semi-Naïve
0
2
7
8
No. of class
B-28
reveals that all discretization methods give higher accuracy
for small class size for engineering dataset. Thus, the
bigger the class size, the lesser the accuracy of
classification.
In general, this experiment has shown that for
engineering dataset the size of class does affect the
classification accuracy for all the discretization methods
experimented. Further study that looks into the suitable
class size for engineering dataset should be conducted. In
addition, a study on the effect of different type of datasets
should be carried out.
References
Ohrn. A.:Discernibility and Rough Sets in Medicine: Tools and
Applications. PhD Thesis, Department of Computer and Information
Science, Norwegian University of Science and Technology,
Trondheim, Norway, NTNU report 1999:133,IDI report 1999:14,
ISBN 82-7984-014-1, 239 pages. (Pub.1999)
(2) Kurgan. L and Cios, K.J.: Discretization Algorithm that Uses ClassAttribute Interdependence Maximization, Proceedings of the 2001
International Conference on Artificial Intelligence (IC-AI 2001),
pp.980-987, Las Vegas, Nevada. (Pub. 2001.)
(3) Chmielewski. M.R and Grzymala-Busse. J.W: Global discretization
of Continuous Attributes as Preprocessing for Medichine Learning,
International of Approximate Reasoning, Vol. 15, pp. 319-331 (Pub.
1996.)
(4) The University of California,,UCI Machine Learning Repository,
http://www.ics.uci.edu/~mlearn/MLRepository.html
(5) Ratanamahatana, C. A.: CloNI: Clustering of √N-Interval
discretization, Proceedings of the 4th International Conference on
Data Mining Including Building Application for CRM &
Competitive Intelligence, Rio de Janeiro, Brazil, (Pub. 2003).
(6) Ohrn A.: ROSETTA Technical Reference Manual. Department
ofComputer and Information Science, Norwegian University of
Science and Technology,Trondheim, Norway. (Pub 2001).
(7) Dougherty, J.,Kohavi, R., and Sahami, M.: Supervised and
unsupervised discretization of continuous features.In Proc. Twelfth
International Conference on Machine Learning. Los Altos, CA:
Morgan Kaufmann, pp. 194–202 (Pub 1995).
(8) Liu, H. et. al:Discretization :An Enabling Technique.Data Mining
and Knowledge Discovery, 6,393-423.(Pub 2002)
(9) Karthigasoo. S, Cheah Y.N. and Manickam.S: Improving the
Accuracy of Medical Decision Support via a Knowledge Discovery
Pipeline using Ensemble Techniques. Journal of Advancing
Information and Management Studies, 2(1), (Pub 2005).
(10) Nguyen H. S. and Skowron A.:Quantization of real-valued
attributes. In Proc. Second International Joint Conference on
Information Sciences, pp. 34–37 (Pub. 1995).
(1)
Figure2: Classification accuracy obtained for various
numbers of classes of engineering dataset
4. Discussion and Conclusion
In this study, five data discretization methods known
as Boolean Reasoning, Equal Frequency Binning, Entropy,
Naïve and Semi Naïve have been tested on two types of
datasets namely medical and engineering. Standard Voting
has been chosen as the classifier.
The results illustrated in Table 4, shows that datasets
from engineering areas have less than 50% accuracy of
classification. In contrast, the results for medical datasets as
shown in Table 3, exhibit more than 70% accuracy of
classification.
Based on the output, Naïve and Boolean Reasoning are
ranked first among the used methods as they are the most
suitable discretization methods in medical area. They
provided better accuracies compared to other three
methods. Similarly, for engineering data with a specific
class distribution, Naïve, semi-naïve and entropy give
better results compared to the others.
From Table 4, the average accuracy of classification for
each discretization methods from cpu and auto datasets are
less than 4%. This experiment reveals that datasets with
more than 7 classes may give poor classification accuracy.
Figure 1 and Figure 2 supports this claim as they reveal
that the classification accuracy decreases as the number of
class size increases.
Based on Figure 1, Equal Frequency and Semi-naïve
discretization methods produce less classification accuracy
when bigger number of classes is used. Similarly, Figure 2
ISBN 978-979-16338-0-2
537