Download Partition Incremental Discretization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Partition Incremental Discretization
Carlos Pinto
João Gama
University of Algarve and LIACC
Rua de Ceuta, 118-6
4050 Porto, Portugal
Email: [email protected]
LIACC, FEP - University of Porto
Rua de Ceuta, 118-6
4050 Porto, Portugal
Email: [email protected]
Abstract— In this paper we propose a new method to perform
incremental discretization. This approach consists in splitting the
task in two layers. The first layer receives the sequence of input
data and stores statistics of this data, using a higher number of
intervals than what is usually required. The final discretization
is generated by the second layer, based on the statistics stored by
the previous layer. The proposed architecture processes streaming
examples in a single scan, in constant time and space even
for infinite sequences of examples. We demonstrate with examples that incremental discretization achieves better results than
batch discretization, maintaining the performance of learning
algorithms. The proposed method is much more appropriate to
evaluate incremental algorithms, and in problems where data
flows continuously as most of recent data mining applications.
Index Terms—Artificial Intelligence, Machine Learning,
Pre-Processing, Incremental Discretization.
I. I NTRODUCTION
Discretization of continuous attributes is an important task
for certain types of machine learning algorithms. In Bayesian
learning applied to continuous features, discretization is the
most common approach [3]. In our research on incremental
Bayesian learning, we faced a problem: how to evaluate
an incremental tree augmented naive Bayes in continuous
domains? We think that the answer involves incremental
discretization. Although discretization is a well-known topic
in data analysis and machine learning, most of the work
refers to batch discretization and very few studies refer to
incremental discretization [9]. This was the main motivation
for the developing of this study.
Discretization is a process that divides continuous numeric
values into a set of intervals that can be regarded as discrete
categorical values. These algorithms enable the processing
of numeric data, by learning schemes that can only handle
nominal attributes. Discretization brings several benefits: it
accelerates the learning scheme, because nominal attributes
are generally processed faster than numeric ones; it reduces
the likelihood of overfitting, by narrowing the space of possible hypotheses that the learning scheme can explore; for
this reason it also lowers the chance of finding a complex
hypothesis that fits the training data particularly well, just by
chance. The resulting classifiers are often significantly less
complex and sometimes more accurate than the classifiers
learned from raw numeric data [16]. This is an important
issue for many algorithms in machine learning, because there
are algorithms that only work with categorical features, like
Bayesian networks. One solution for this field is to discretized
the attribute.
For the classification task we use a tree augmented naive
Bayes - (TAN) algorithm. This is a Bayesian algorithm presented by Friedman [7]. We use an incremental version of TAN
(TANi) to evaluate the proposed incremental discretization.
Our implementation of TANi follows closely the incremental
TAN presented in [15]. This algorithm achieves a performance,
similar to batch TAN.
In the rest of the paper it will be introduced some incremental learning concepts, followed by a review of some discretization methods (section 2). Then we will present our own
incremental discretization method (section 3) and compare the
incremental discretization with algorithms working with batch
discretization (section 4). The main conclusions and future
work is described in Section 5.
II. BACKGROUND AND R ELATED W ORK
A. Incremental Learning
The aim of Machine learning is to obtain algorithms that
improve their performance through experience. Nowadays
there exists many practical reasons that increase the interest in
incremental learning. Companies from a very wide range of
activities store huge amounts of data every day. One-shot algorithms are unable to process easily this huge amount of continuously incoming instances and incorporate it in a knowledge
base, in a reasonable amount of time and memory space. We
believe that in this environment, incremental learning becomes
particularly relevant since this sort of algorithms are able to
revise already existing models of data without beginning from
scratch and without re-processing past data [15].
Discretization of continuous attributes is an important task
for certain types of machine learning algorithms. Bayesian
approaches, for instance, require assumptions about data distributions. Decision Trees require sorting operations for dealing
with continuous attributes, which largely increase learning
times. In the rest of this section we revisit the most used
methods for discretization.
B. Review of discretization methods
Nowadays there are hundreds of discretization methods.
In [4], [16], [5], [13], [8] the authors present a large overview
of the area. According to [4] we can define three different
axis upon which we may classify discretization methods:
168
0-7803-9365-1/05/$17.00©2005 IEEE
supervised vs. unsupervised, global vs. local and static vs.
dynamic. Supervised methods use the information of class
labels while unsupervised methods do not. Local methods like
the one used by C4.5 [14], produce partitions that are applied
to localized regions of the instance space. Global methods
such as binning are applied before the learning process to
the whole dataset. In static methods attributes are discretized
independently of each other, while dynamic methods take into
account the interdependencies between them. Note that in this
paper we will review just static methods. There is another
axis, that is parametric and non-parametric methods, where
parametric methods need user input like the number of cutpoints and non-parametric only use information from data.
Some examples of discretization methods are:
• Equal width discretization (EWD), this is the simplest
method to discretize a continuous-valued attribute; it
divides the range of observed values for a feature into
k equally sized bins. Normally the user gives has input
the number of intervals, so it is a parametric method.
The main advantaged of this global and unsupervised
method is its simplicity, but it has a major drawback: the
presence of outliers.
• Equal frequency discretization (EFD), like the previous
method, it divides the range of observed values into k
bins where (considering n instances) each bin contains
n/k values. For the same reasons, this method is also
global, unsupervised and parametric.
• k-means, this is a more sophisticated version of bins.
In this method the distribution of the values over the k
intervals minimizes the intra-interval and maximizes the
inter-interval distance. This is an iterative method that
begins with an equal width discretization and only stops
when it can not change any value to improve the criteria
above. Like the methods previously reviewed, this is a
global, unsupervised and parametric method.
• Recursive entropy discretization (ENT-MDL), this was
presented by Fayyad and Irani [6] in 1993; it uses the
entropy to select the cut-points, and iteratively selects the
best cut-points; to analyze entropy it uses class labels so it
is a supervised method. It uses the minimum description
length as the stop criterium, this way it does not require
the user intervention so it is a non-parametric method.
This is a top-down discretization: during the application
of this method it is created a tree of discretization. In
the original paper it was applied locally at each node
during tree generations, but some authors have applied the
method with global discretization, achieving good results
[4].
• Proportional discretization (PD), this was presented
by Ying Yang, it calculates the number of intervals and
frequency with a very simple formula, s × t = n,
where t is the number of intervals, s is the frequency in
each interval and n the number of training instances. This
method was created in the context of naive Bayes; the
author says that by setting the number of intervals and
frequency proportional to the training data reduces the
bias and variance ( for more information consult [16]).
This is also a more sophisticated version of bins; it uses
no class information and applies to the totality of the data
set, so it is a global, unsupervised and non-parametric
method.
III. O UR PROPOSAL FOR INCREMENTAL DISCRETIZATION
As we previously referred, many algorithms developed
in the machine learning community focus on learning with
categorical attributes [11], [12]. However, many real word
problems involve continuous features, where such algorithms
could not be applied unless the continuous features where
discretized.
Fig. 1. Example of discretization in two layers with an unsupervised method.
From a dataset we obtain a very fine discretization and from that layer we
obtain a more refined discretization; this illustrates the principle of PiD
We will now introduce a new method of discretization,
that can perform an incremental discretization. With this
approach the discretization can be reformulated when new
data arrives. The main problem we address is: how to update a
given discretization when new data becomes available, without
storing all the data seen so far?
The Partition Incremental Discretization algorithm (PiD for
short) is splited in two layers: the first layer, that simplifies and
summarizes the data; and the second layer, which performs the
final discretization. It works online, processing each example
once. It can process infinite sequences of data in constant
time and space. PiD can work in two modes: supervised, and
unsupervised. While the former uses the information about
the class label of the examples, the latter does not require this
information.
Each layer must be initialized, and updated when new data
is available. We will now describe each layer:
First layer is defined by a fixed number of partitions. In each
partition PiD maintains a counter with the number of examples
whose attribute-value fall at this partition. If the method is
applied in supervised way then the method keeps a counter
per class per partition. The input for this layer is the raw data.
Second layer provides the final set of intervals. It use a
discretization method selected by the user. If this discretization
method requires the number of intervals as input, the user must
169
also define this number. The input for this layer is the set of
intervals of the first layer.
A. Initialization of the layers:
First layer The number of intervals in this layer should be
much higher than required. It can be initialized in two modes:
• Without seeing any previous data. We use a EWD strategy
to generate the initial partitions. The range of all the
continuous variables is needed.
• The layer is initialized with some data. It can use EWD
or EFW to generate the partitions.
When the second layer runs for the first time, it is given the
frequency distribution of the first layer and the discretization
method to be used. The first time the layer runs it is decided
the number of cut-points and it is generated the discretization
to be used in the data.
B. Updating the layers
When new data arrives, the first layer analyzes the data and
updates the frequency in each partition. This can be done of
the fly, using EWD. The limits of the intervals are fixed.
After the first layer is updated, the second layer reformulates the final discretization based on the new frequency
distribution of the attributes values. This can be done either by
restarting the discretization or by reformulating the previous
discretization. After obtaining the new discretization, it is
applied to the new data and it is this discretized data that
are passed to the incremental algorithm or agent.
Roughly speaking the first layer summarizes the data seen
at the time and the second layer discretizes the data. What
happens is that the discretization of the second layer has more
information, once the partitions frequency in the first layer
increases, giving more confidence to the final discretization.
PiD can be used with any algorithm, it is only a discretization
method that can accomplish the discretization of data in
chunks.
The principal difference between PiD and the methods
revised is that in PiD the final discretization is achieved with
a previous discretization and some statistics (first layer). This
allows to reformulate the final discretization (second layer)
when new data arrives.
This method can also converge to batch discretization. This
characteristic depends on the number of intervals of the first
layer. If the number of intervals in the first layer would tend to
infinite, then the distribution presented in the first layer would
be like a vector of points. In this case the second layer will be
based on the whole data set. What we will attempt to prove in
the experimental tests is that we can relax this without having a
significant degradation of the learning algorithm. Maybe with
an example the method can be more clear. In figure 2 we can
observe an example of PiD in three steps with equal frequency
discretization on both layers.
(a) PiD generates a discretization with ten intervals for
the first layer. After this, the method generates three
intervals based on the first layer, where each interval
Fig. 2. An example of PiD running with equal frequency discretization on
the second layer. The intervals with numbers from 0 to 9 are the cut-points of
the first layer, and the vertical bars are the cut-points of the second layer. With
EFD the algorithm tries to maintain the frequency in the final discretization
changing the cut-points of the second layer.
contains approximately six examples. The 3 intervals
of the second layer are: 0-3; 3-6; and 6-10.
(b)
new data arrives, and PiD updates the frequency in
each partition of the first layer. The discretization
method updates the cut points of the second layer
so that each interval contains approximately eight
examples. Now, the 3 intervals are: 0-2; 2-6; and
6-10.
(c) new data arrives and PiD updates the frequency in
each partition of the first layer, then the method
updates the cut-points of the second layer so that
each contains approximately ten examples. The 3
intervals are now: 0-2; 2-7; and 7-10.
In this example, an instance with an attribute-value of 6.5
would be labelled in the third interval after seeing the first
and second chunk of data. After the third chunk, the same
example would be labeled in the second interval.
The Iris data is used as an example in illustrating results
of different discretization methods. It contains 150 instances
with four continuous features and three class labels. We
applied PiD with chunks of thirty instances. The resulting
discretization are described in Table III. The first one shows
the cut-points obtained in the first layer. The other two tables
show the intervals obtained by supervised and unsupervised
discretization. The supervised method used in the second layer
was recursive entropy discretization. The unsupervised method
used was equal frequency. This example shows how the limits
of the intervals are flexible, and change on data demand.
IV. E XPERIMENTAL E VALUATION
A. Methodology
All datasets are from the U. C. Irvine repository [2]. The
accuracy of each classifier is computed as the percentage of
successful predictions on the test set. Accuracy was evaluated
170
Supervised
Unsupervised
Incremental
discretization
wins
equals
number of instances in the first chunk of data. In the datasets
used in this paper this choice works well.
In all the experiments the learning algorithms used are a
batch TAN and its incremental version TANi.
Incremental
loses
wins
equals
loses
batch
0
14
4
2
12
4
initial
3
14
1
4
12
2
TABLE I
B. Comparison between supervised and unsupervised incremental discretization
S UMMARY OF RESULTS OF THE W ILCOXON TEST WITH SIGNIFICANCE
LEVEL OF
5%.
using 10-fold cross validation. All algorithms generate a model
on the same training sets and models were evaluated on the
same test sets. The following pseudo-algorithms illustrates the
differences between incremental and batch modes:
Evaluation in batch mode:
1- Discretize the training data.
2- Discretize test set using intervals obtained from point 1.
3- Learn a model using discretized training data.
4- Evaluate the model on the discretized test data.
Evaluation in incremental mode:
1- For each chunk of training examples E
2- Discretize E using PID: update the actual set of intervals
used by PID and obtain the discretized set of examples
Ediscretized .
3- Update the incremental learning model with E discretized .
4- Next chunk
5- Discretize the test set using the actual set of intervals of
PID.
6- Evaluate the actual learning model on the discretized Test
set.
We have evaluated two situations to initialize the first layer:
one without seeing any data with equal width discretization,
and another after seeing some data using equal frequency. Due
to lack of space we only present the results for equal frequency. Since this is a parametric method, to find the number
of cut points we used the heuristic number of intervals =
l
√
4 , where l is the number of instances in the first chunk of
l
data. The minimum number of partitions for the fist layer was
set to 10. We carried out experiments with supervised and
unsupervised discretization in the second layer. The minimum
number of intervals was set to two.
The supervised discretization method chosen was recursive
entropy discretization. This method is described in this paper,
but more information about the method can be found in [6].
In the initialization phase the stopping criteria used was the
minimum description length. The initial discretization defines
the number of intervals of the second layer1 . When more data
is available the intervals can grow or shrink. The unsupervised
discretization method used was equal frequency . Since this
method is parametric to find the number of
√intervals we used
the heuristic number of intervals = 4 l, where l is the
1 In the experiments reported here, the number of intervals is defined by
the first set of data and never changes when new data is available. This is
a restriction imposed by the learning algorithm we use. As most of learning
algorithms, it requires that the number of possible values of a discrete attribute
is known and fixed in advanced.
Table IV present the results of batch, initial and incremental
discretization for supervised and unsupervised discretization.
The first three columns present the results for unsupervised
discretization and the last three, the results for supervised
discretization. Supervised discretization takes advantage over
unsupervised both on batch and incremental versions. For
comparative purposes we present also the results from a
discretization using only the first chunk of data. It is interesting
to observe that unsupersived has some advantages over supervised. This can be due to the fact that supervised discretization
needs more examples to perform a good discretization than its
unsupervised version.
C. Comparison between batch and incremental discretizations
In Table IV we can analyze the difference of performances
between the batch discretization, the initial discretization (with
only 10% of the data) and the incremental discretization.
Table I presents a summary of the significance of differences
using Wilcoxon test. We can see the importance of the
discretization. It is clear that a batch discretization outperforms
an initial discretization. But we can see that with a incremental
discretization it is possible to improve the results of an initial
discretization. The average error in all datasets shows that the
incremental discretization improves the accuracy of the same
algorithm using the initial discretization. From these results we
can conclude that batch discretization is an optimistic method
to evaluate incremental discretization; PiD is able to approach
batch discretization. In these datasets, most differences are not
significant.
D. Varying the number of examples per chunk
Table II present the results of TANi with an incremental
discretization for different sizes of chunks. The size of chunks
varies from 10% of data to 1% of the data. We would like
to emphasize that to build a model and a discretization with
1% of data is extremely difficult. But even in this rough
conditions, the results are near the batch algorithm. The
accuracy improves when the number of instances per chunk
grows, an expected result. Nevertheless, as we have pointed
out before, PiD depends on the initial discretization. When the
first chunk of instances grows then it constitutes a good base.
A set of experiments, not reported here due to lack of space,
indicated that PiD seems to be resilient to the order in which
the examples are given.
E. A real-world Application
In the dataset TrawlHauls (Fig. 3) the goal is to predict when
a trawl vessel is fishing or not from GPS information [1]. The
171
Discretization
Data set
Supervised
Supervised
Supervised
Supervised
Supervised
Supervised
Batch
Batch
Incremental
Incremental
Incremental
Incremental
TAN
TANi
TANi
TANi
TANi
TANi
Chunks 10%
Chunks 10%
Chunks 5%
Chunks 2%
Chunks 1%
82.17±8.14
Australian
87.10±4.12
86.23±4.11
85.80±4.47
85.65±2.69
85.36±4.85
Pima
76.44±4.25
75.26±4.09
74.98±6.38
72.52±3.45
74.08±4.36
73.18±5.84
Diabetes
75.01±2.65
75.13±3.23
73.83±3.86
73.70±4.58
73.44±4.87
73.57±3.57
Tokyo
91.55±2.43
92.28±1.50
91.34±2.16
92.39±2.31
88.11±4.09
89.05±3.87
German
70.70±3.59
71.50±3.81
71.88±4.98
72.10±3.75
71.40±3.60
72.90±4.86
Segmentation
94.98±1.45
94.68±1.37
93.90±1.56
92.38±2.69
91.39±2.01
91.73±1.36
Waveform
80.68±1.41
80.80±2.02
80.04±1.34
77.92±1.99
77.40±1.51
77.42±1.42
Churn
89.02±1.92
87.04±1.44
86.84±1.73
86.02±1.21
86.36±1.49
86.28±1.23
Satellite Image
87.65±1.30
87.89±1.24
85.86±1.67
86.22±0.96
85.77±1.43
84.04±1.53
Adult
83.37±0.56
85.66±0.54
84.72±0.42
84.29±0.49
84.31±0.52
84.23±0.57
Shuttle
99.93±0.04
99.86±0.07
99.78±0.08
99.77±0.07
99.69±0.07
99.63±0.12
Average
85.07
85.05
84.45
83.91
83.39
83.11
TABLE II
E XPERIMENTAL RESULTS OF BATCH ALGORITHMS AND P I D WITH DIFFERENT SIZE OF CHUNKS
Feature
Att1
Att2
Att3
Att4
First layer with equal frequency discretization
Cut Points
4.95;5.15;5.35;5.45;5.55;5.65;5.75;5.86;5.95;6.25;6.40;6.80
2.35;2.45;2.60;2.75;2.95;3.05;3.15;3.25;3.35;3.45;3.60;3.75
1.35;1.45;1.55;2.60;3.80;4.10;4.30;4.55;4.80;5.05;5.45;5.65
0.25;0.75;1.05;1.15;1.20;1.25;1.35;1.45;1.65;1.75;1.95;2.15
Num. Points
12
12
12
12
Second layer with equal frequency discretization (Unsupervised discretization)
Feature
Cut points after
30 instances 60 instances 90 instances 120 instances 150 instances
Att1
5.55
5.65
5.65
5.75
5.75
Att2
2.95
2.95
2.95
2.95
2.95
Att3
4.30
4.10
4.10
4.10
4.30
Att4
1.25
1.25
1.25
1.25
1.25
Feature
Att1
Att2
Att3
Att4
Num. of Intervals
2
2
2
2
Second layer with recursive entropy discretization (Supervised discretization)
Cut points after
Num. Points
30 instances 60 instances 90 instances 120 instances 150 instances
5.45
5.55
5.86
5.45
5.55
2
2.75
3.25
3.25
3.25
3.05
2
2.60; 5.05
2.60; 5.05
2.60; 5.05
2.60; 4.80
2.60; 4.80
3
0.75; 1.65
0.75; 1.65
0.75; 1.65
0.75; 1.75
0.75; 1.75
3
TABLE III
R ESULTS OF SUPERVISED AND UNSUPERVISED DISCRETIZATION WITH I RIS DATA SET
attributes are derived from a window of 11 consecutive points
(five before the decision point and 5 after). 11 points are the
direction and the other 11 are the speed of the vessel. This
is a two class problem, defined by 22 attributes. From all the
available data, 675.329 examples, we randomly select 10%
for test and the remainder (607886 examples) were used for
training. Using this training set and test set the results of the 3
PiD discretization methods using a naive Bayes classifier are:
Entropy: 6.86%, Equal-Width: 6.91%, and Equal-frequency:
6.93%. For comparative purposes the error of naive Bayes
updatable in WEKA [10] is 9.11% 2 .
V. C ONCLUSION AND F UTURE W ORK
In this paper we present a new method for incremental
discretization. Although incremental learning is an hot topic
2 Other variants, including the discretization procedures implemented in
Weka didn’t run due to lack of memory.
in machine learning and discretization is a fundamental preprocessing step for some well-known algorithms, the topic of
incremental discretization has received few attention from the
community. We have introduced a new discretization method
that works in two layers. This two-stage architecture is very
flexible. It can be used as supervised or unsupervised. For the
second layer any base discretization method can be used. The
most relevant aspect is that the boundaries of the intervals of
the second layer could change when new data is available.
We also note that this method can save memory and computer
time. Since discretization simplifies and reduces data storage,
having a first layer with equal frequency as preparation of a
second layer with more complex methods like recursive entropy discretization or chi-merge, it can be faster than methods
that apply to the whole dataset. The main advantage of the
proposed method is the ability to process training examples in
172
[12] R. Kohavi and M. Sahami. Error-based and entropy-based discretization
of continuous features. In Proceedings of the Second International
Conference on Knowledge Discovery and Data Mining, pages 114–119,
1996.
[13] M. J. Pazzani. An iterative improvement approach for the discretization
of numeric attributes in bayesian classifiers. In Proceedings of the First
International Conference on Knowledge Discovery and Data Mining,
1995.
[14] R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann
Publishers, Inc. San Mateo, CA, 1993.
[15] J. Roure. Incremental Methods for Bayesian Network Structure Learning. PhD thesis, Universidad Politécnica de Cataluna, 2004.
[16] Y. Yang. Discretization for Naive-Bayes Learning. PhD thesis, School
of Computer Science and Software Engineering of Monash University,
July 2003.
Fig. 3. Illustrative figure for ’trawl haul’ problem. The bottom panel shows
the speed of the vessel and predicted trawl hauls. The blue lines show the
predictions and the black lines show the
. trawl hauls identified by the user
a single scan over the data. PiD processes examples in constant
time and space, even for infinite sequences of streaming data.
ACKNOWLEGEMENTS
This work was developed under project Adaptive Learning
Systems II (POSI/EIA/55340/2004).
R EFERENCES
[1] M. Afonso-Dias, J. Simoes, and C. Pinto. A dedicated gis to estimate and
map fishing effort and landings for the portuguese crustacean trawl fleet.
In T. Nishida, P. J. Kaiola, and C. E. Hollingworth, editors, GIS/Spatial
Analyses in Fishery and Aquatic Sciences, volume 2, pages 323–340.
Fishery-Aquatic GIS Research Group, Saitama, Japan, 2004.
[2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases,
1998.
[3] P. Domingos and M. J. Pazzani. On the optimality of the simple bayesian
classifier under zero-one loss. Machine Learning, 29(2-3):103–130,
1997.
[4] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised
discretization of continuous features. In Proceedings 12th International
Conference on Machine Learning, pages 194–202. Morgan and Kaufmann, 1995.
[5] Tapio Elomaa and Juho Rousu. Necessary and suficient pre-processing
in numerical range discretization. Knowledge and Information Systems
(2003) 5: 162 182, 5:162–182, 2003.
[6] U. M. Fayyad and K. B Irani. Multi-interval discretization of continuous valued attributes for classification learning. In 13 International
Joint Conference on Artificial Intelligence, pages 1022–1027. Morgan
Kaufmann, 1993.
[7] N. Friedman and M. Goldszmidt. Building classifiers using bayesian
networks. In AAAI/IAAI, Vol. 2, pages 1277–1284, 1996.
[8] Raul Giraldez, Jesus S. Aguilar-Ruiz, Jose C. Riquelme, Francisco J.
Ferrer-Troyano, and Domingo S. Rodriguez-Baena. Discretization oriented to decision rule generation. In 6th International Conference on
Knowledge-Based Intelligent Information Engineering Systems, pages
275–279. IOS Press, 2002.
[9] S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In
STOC ’01: Proceedings of the thirty-third annual ACM symposium on
Theory of computing, pages 471–475. ACM Press, 2001.
[10] E. Frank I. H. Witten. Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations. Morgan Kaufmann
Publishers, Inc., 1999.
[11] R. Kohavi. A study of cross-validation and bootstrap for accuracy
estimation and model selection. In IJCAI, pages 1137–1145. Morgan
and Kaufmann, 1995.
173
174
19
40
16
36
6
2310
5000
5000
6435
45222
58000
Segmentation
Waveform
Churn
Satellite Image
Adult
Shuttle
0
8
0
4
1
1
7
0
0
1
6
1
1
0
9
7
0
0
Discrete
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
Class
Values
84.36
83.09
99.31±0.13
83.30±0.65
85.75±1.10
85.86±1.12
79.68±1.91
90.74±2.31
72.30±4.14
90.93±2.86
73.18±2.98
72.52±4.02
86.81±3.71
80.23±10.0
92.34±7.04
66.13±9.71
82.79±4.22
78.52±6.94
82.56±3.98
92.68±8.01
TABLE IV
99.68±0.05
83.52±0.55
86.81±1.32
84.46±1.46
79.28±1.25
92.21±1.31
71.80±3.16
90.82±2.24
75.79±4.33
74.75±3.33
86.52±4.04
96.19±2.21
98.46±1.79
65.52±6.50
78.72±6.37
80.00±7.85
84.08±4.11
89.90±7.31
83.29
98.82±0.09
83.32±0.62
89.65±1.56
83.94±1.42
81.10±1.49
89.65±1.56
72.70±3.47
91.35±2.82
72.39±3.89
72.39±4.54
85.65±3.64
87.71±3.37
91.06±7.18
65.80±5.70
82.09±4.24
77.78±8.73
83.32±4.02
90.46±6.03
TANi
Chunks 10%
TANi
Chunks 10%
TANi
Incremental
Initial
Batch
Unsupervised
Unsupervised
Unsupervised
85.14
99.86±0.08
85.66±0.54
87.89±1.24
87.04±1.44
80.80±2.02
94.68±1.37
71.50±3.81
92.28±1.50
75.13±3.23
75.26±4.09
86.23±4.11
96.49±2.10
97.18±3.30
70.09±8.34
78.06±4.19
77.04±7.77
83.00±2.63
94.41±7.41
TANi
Batch
Supervised
82.81
99.80±0.05
84.74±0.43
86.73±1.60
86.6±1.35
79.68±1.91
77.49±1.80
72.30±4.14
90.09±2.62
73.05±5.06
71.99±4.17
85.94±3.81
96.19±2.96
91.06±6.64
60.33±11.1
80.46±6.35
80.37±8.38
82.25±3.70
91.57±5.42
TANi
Chunks 10%
Initial
Supervised
Supervised
84.37
99.78±0.08
84.72±0.42
85.86±1.67
86.84±1.73
80.04±1.34
93.90±1.56
71.88±4.98
91.34±2.16
73.83±3.86
74.98±6.38
85.80±4.47
97.51±2.30
91.56±6.18
66.06±5.90
80.07±5.60
79.63±9.11
82.09±3.93
92.68±5.32
TANi
Chunks 10%
Incremental
E XPERIMENTAL RESULTS OF BATCH ALGORITHMS AND P I D WITH SUPERVISED AND UNSUPERVISED INCREMENTAL DISCRETIZATION METHOD
Average of accuracy
9
44
13
959
1000
8
768
Diabetes
German
8
768
Tokyo
8
690
BreastLoss
Pima
7
31
393
569
Cars
Australian
4
6
296
345
6
270
Heart
Cleve
33
194
Liver-disorder
13
178
Continuous
Wine
Instances
Attributes
CRX
Data Set
Number
Discretization