Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Partition Incremental Discretization Carlos Pinto João Gama University of Algarve and LIACC Rua de Ceuta, 118-6 4050 Porto, Portugal Email: [email protected] LIACC, FEP - University of Porto Rua de Ceuta, 118-6 4050 Porto, Portugal Email: [email protected] Abstract— In this paper we propose a new method to perform incremental discretization. This approach consists in splitting the task in two layers. The first layer receives the sequence of input data and stores statistics of this data, using a higher number of intervals than what is usually required. The final discretization is generated by the second layer, based on the statistics stored by the previous layer. The proposed architecture processes streaming examples in a single scan, in constant time and space even for infinite sequences of examples. We demonstrate with examples that incremental discretization achieves better results than batch discretization, maintaining the performance of learning algorithms. The proposed method is much more appropriate to evaluate incremental algorithms, and in problems where data flows continuously as most of recent data mining applications. Index Terms—Artificial Intelligence, Machine Learning, Pre-Processing, Incremental Discretization. I. I NTRODUCTION Discretization of continuous attributes is an important task for certain types of machine learning algorithms. In Bayesian learning applied to continuous features, discretization is the most common approach [3]. In our research on incremental Bayesian learning, we faced a problem: how to evaluate an incremental tree augmented naive Bayes in continuous domains? We think that the answer involves incremental discretization. Although discretization is a well-known topic in data analysis and machine learning, most of the work refers to batch discretization and very few studies refer to incremental discretization [9]. This was the main motivation for the developing of this study. Discretization is a process that divides continuous numeric values into a set of intervals that can be regarded as discrete categorical values. These algorithms enable the processing of numeric data, by learning schemes that can only handle nominal attributes. Discretization brings several benefits: it accelerates the learning scheme, because nominal attributes are generally processed faster than numeric ones; it reduces the likelihood of overfitting, by narrowing the space of possible hypotheses that the learning scheme can explore; for this reason it also lowers the chance of finding a complex hypothesis that fits the training data particularly well, just by chance. The resulting classifiers are often significantly less complex and sometimes more accurate than the classifiers learned from raw numeric data [16]. This is an important issue for many algorithms in machine learning, because there are algorithms that only work with categorical features, like Bayesian networks. One solution for this field is to discretized the attribute. For the classification task we use a tree augmented naive Bayes - (TAN) algorithm. This is a Bayesian algorithm presented by Friedman [7]. We use an incremental version of TAN (TANi) to evaluate the proposed incremental discretization. Our implementation of TANi follows closely the incremental TAN presented in [15]. This algorithm achieves a performance, similar to batch TAN. In the rest of the paper it will be introduced some incremental learning concepts, followed by a review of some discretization methods (section 2). Then we will present our own incremental discretization method (section 3) and compare the incremental discretization with algorithms working with batch discretization (section 4). The main conclusions and future work is described in Section 5. II. BACKGROUND AND R ELATED W ORK A. Incremental Learning The aim of Machine learning is to obtain algorithms that improve their performance through experience. Nowadays there exists many practical reasons that increase the interest in incremental learning. Companies from a very wide range of activities store huge amounts of data every day. One-shot algorithms are unable to process easily this huge amount of continuously incoming instances and incorporate it in a knowledge base, in a reasonable amount of time and memory space. We believe that in this environment, incremental learning becomes particularly relevant since this sort of algorithms are able to revise already existing models of data without beginning from scratch and without re-processing past data [15]. Discretization of continuous attributes is an important task for certain types of machine learning algorithms. Bayesian approaches, for instance, require assumptions about data distributions. Decision Trees require sorting operations for dealing with continuous attributes, which largely increase learning times. In the rest of this section we revisit the most used methods for discretization. B. Review of discretization methods Nowadays there are hundreds of discretization methods. In [4], [16], [5], [13], [8] the authors present a large overview of the area. According to [4] we can define three different axis upon which we may classify discretization methods: 168 0-7803-9365-1/05/$17.00©2005 IEEE supervised vs. unsupervised, global vs. local and static vs. dynamic. Supervised methods use the information of class labels while unsupervised methods do not. Local methods like the one used by C4.5 [14], produce partitions that are applied to localized regions of the instance space. Global methods such as binning are applied before the learning process to the whole dataset. In static methods attributes are discretized independently of each other, while dynamic methods take into account the interdependencies between them. Note that in this paper we will review just static methods. There is another axis, that is parametric and non-parametric methods, where parametric methods need user input like the number of cutpoints and non-parametric only use information from data. Some examples of discretization methods are: • Equal width discretization (EWD), this is the simplest method to discretize a continuous-valued attribute; it divides the range of observed values for a feature into k equally sized bins. Normally the user gives has input the number of intervals, so it is a parametric method. The main advantaged of this global and unsupervised method is its simplicity, but it has a major drawback: the presence of outliers. • Equal frequency discretization (EFD), like the previous method, it divides the range of observed values into k bins where (considering n instances) each bin contains n/k values. For the same reasons, this method is also global, unsupervised and parametric. • k-means, this is a more sophisticated version of bins. In this method the distribution of the values over the k intervals minimizes the intra-interval and maximizes the inter-interval distance. This is an iterative method that begins with an equal width discretization and only stops when it can not change any value to improve the criteria above. Like the methods previously reviewed, this is a global, unsupervised and parametric method. • Recursive entropy discretization (ENT-MDL), this was presented by Fayyad and Irani [6] in 1993; it uses the entropy to select the cut-points, and iteratively selects the best cut-points; to analyze entropy it uses class labels so it is a supervised method. It uses the minimum description length as the stop criterium, this way it does not require the user intervention so it is a non-parametric method. This is a top-down discretization: during the application of this method it is created a tree of discretization. In the original paper it was applied locally at each node during tree generations, but some authors have applied the method with global discretization, achieving good results [4]. • Proportional discretization (PD), this was presented by Ying Yang, it calculates the number of intervals and frequency with a very simple formula, s × t = n, where t is the number of intervals, s is the frequency in each interval and n the number of training instances. This method was created in the context of naive Bayes; the author says that by setting the number of intervals and frequency proportional to the training data reduces the bias and variance ( for more information consult [16]). This is also a more sophisticated version of bins; it uses no class information and applies to the totality of the data set, so it is a global, unsupervised and non-parametric method. III. O UR PROPOSAL FOR INCREMENTAL DISCRETIZATION As we previously referred, many algorithms developed in the machine learning community focus on learning with categorical attributes [11], [12]. However, many real word problems involve continuous features, where such algorithms could not be applied unless the continuous features where discretized. Fig. 1. Example of discretization in two layers with an unsupervised method. From a dataset we obtain a very fine discretization and from that layer we obtain a more refined discretization; this illustrates the principle of PiD We will now introduce a new method of discretization, that can perform an incremental discretization. With this approach the discretization can be reformulated when new data arrives. The main problem we address is: how to update a given discretization when new data becomes available, without storing all the data seen so far? The Partition Incremental Discretization algorithm (PiD for short) is splited in two layers: the first layer, that simplifies and summarizes the data; and the second layer, which performs the final discretization. It works online, processing each example once. It can process infinite sequences of data in constant time and space. PiD can work in two modes: supervised, and unsupervised. While the former uses the information about the class label of the examples, the latter does not require this information. Each layer must be initialized, and updated when new data is available. We will now describe each layer: First layer is defined by a fixed number of partitions. In each partition PiD maintains a counter with the number of examples whose attribute-value fall at this partition. If the method is applied in supervised way then the method keeps a counter per class per partition. The input for this layer is the raw data. Second layer provides the final set of intervals. It use a discretization method selected by the user. If this discretization method requires the number of intervals as input, the user must 169 also define this number. The input for this layer is the set of intervals of the first layer. A. Initialization of the layers: First layer The number of intervals in this layer should be much higher than required. It can be initialized in two modes: • Without seeing any previous data. We use a EWD strategy to generate the initial partitions. The range of all the continuous variables is needed. • The layer is initialized with some data. It can use EWD or EFW to generate the partitions. When the second layer runs for the first time, it is given the frequency distribution of the first layer and the discretization method to be used. The first time the layer runs it is decided the number of cut-points and it is generated the discretization to be used in the data. B. Updating the layers When new data arrives, the first layer analyzes the data and updates the frequency in each partition. This can be done of the fly, using EWD. The limits of the intervals are fixed. After the first layer is updated, the second layer reformulates the final discretization based on the new frequency distribution of the attributes values. This can be done either by restarting the discretization or by reformulating the previous discretization. After obtaining the new discretization, it is applied to the new data and it is this discretized data that are passed to the incremental algorithm or agent. Roughly speaking the first layer summarizes the data seen at the time and the second layer discretizes the data. What happens is that the discretization of the second layer has more information, once the partitions frequency in the first layer increases, giving more confidence to the final discretization. PiD can be used with any algorithm, it is only a discretization method that can accomplish the discretization of data in chunks. The principal difference between PiD and the methods revised is that in PiD the final discretization is achieved with a previous discretization and some statistics (first layer). This allows to reformulate the final discretization (second layer) when new data arrives. This method can also converge to batch discretization. This characteristic depends on the number of intervals of the first layer. If the number of intervals in the first layer would tend to infinite, then the distribution presented in the first layer would be like a vector of points. In this case the second layer will be based on the whole data set. What we will attempt to prove in the experimental tests is that we can relax this without having a significant degradation of the learning algorithm. Maybe with an example the method can be more clear. In figure 2 we can observe an example of PiD in three steps with equal frequency discretization on both layers. (a) PiD generates a discretization with ten intervals for the first layer. After this, the method generates three intervals based on the first layer, where each interval Fig. 2. An example of PiD running with equal frequency discretization on the second layer. The intervals with numbers from 0 to 9 are the cut-points of the first layer, and the vertical bars are the cut-points of the second layer. With EFD the algorithm tries to maintain the frequency in the final discretization changing the cut-points of the second layer. contains approximately six examples. The 3 intervals of the second layer are: 0-3; 3-6; and 6-10. (b) new data arrives, and PiD updates the frequency in each partition of the first layer. The discretization method updates the cut points of the second layer so that each interval contains approximately eight examples. Now, the 3 intervals are: 0-2; 2-6; and 6-10. (c) new data arrives and PiD updates the frequency in each partition of the first layer, then the method updates the cut-points of the second layer so that each contains approximately ten examples. The 3 intervals are now: 0-2; 2-7; and 7-10. In this example, an instance with an attribute-value of 6.5 would be labelled in the third interval after seeing the first and second chunk of data. After the third chunk, the same example would be labeled in the second interval. The Iris data is used as an example in illustrating results of different discretization methods. It contains 150 instances with four continuous features and three class labels. We applied PiD with chunks of thirty instances. The resulting discretization are described in Table III. The first one shows the cut-points obtained in the first layer. The other two tables show the intervals obtained by supervised and unsupervised discretization. The supervised method used in the second layer was recursive entropy discretization. The unsupervised method used was equal frequency. This example shows how the limits of the intervals are flexible, and change on data demand. IV. E XPERIMENTAL E VALUATION A. Methodology All datasets are from the U. C. Irvine repository [2]. The accuracy of each classifier is computed as the percentage of successful predictions on the test set. Accuracy was evaluated 170 Supervised Unsupervised Incremental discretization wins equals number of instances in the first chunk of data. In the datasets used in this paper this choice works well. In all the experiments the learning algorithms used are a batch TAN and its incremental version TANi. Incremental loses wins equals loses batch 0 14 4 2 12 4 initial 3 14 1 4 12 2 TABLE I B. Comparison between supervised and unsupervised incremental discretization S UMMARY OF RESULTS OF THE W ILCOXON TEST WITH SIGNIFICANCE LEVEL OF 5%. using 10-fold cross validation. All algorithms generate a model on the same training sets and models were evaluated on the same test sets. The following pseudo-algorithms illustrates the differences between incremental and batch modes: Evaluation in batch mode: 1- Discretize the training data. 2- Discretize test set using intervals obtained from point 1. 3- Learn a model using discretized training data. 4- Evaluate the model on the discretized test data. Evaluation in incremental mode: 1- For each chunk of training examples E 2- Discretize E using PID: update the actual set of intervals used by PID and obtain the discretized set of examples Ediscretized . 3- Update the incremental learning model with E discretized . 4- Next chunk 5- Discretize the test set using the actual set of intervals of PID. 6- Evaluate the actual learning model on the discretized Test set. We have evaluated two situations to initialize the first layer: one without seeing any data with equal width discretization, and another after seeing some data using equal frequency. Due to lack of space we only present the results for equal frequency. Since this is a parametric method, to find the number of cut points we used the heuristic number of intervals = l √ 4 , where l is the number of instances in the first chunk of l data. The minimum number of partitions for the fist layer was set to 10. We carried out experiments with supervised and unsupervised discretization in the second layer. The minimum number of intervals was set to two. The supervised discretization method chosen was recursive entropy discretization. This method is described in this paper, but more information about the method can be found in [6]. In the initialization phase the stopping criteria used was the minimum description length. The initial discretization defines the number of intervals of the second layer1 . When more data is available the intervals can grow or shrink. The unsupervised discretization method used was equal frequency . Since this method is parametric to find the number of √intervals we used the heuristic number of intervals = 4 l, where l is the 1 In the experiments reported here, the number of intervals is defined by the first set of data and never changes when new data is available. This is a restriction imposed by the learning algorithm we use. As most of learning algorithms, it requires that the number of possible values of a discrete attribute is known and fixed in advanced. Table IV present the results of batch, initial and incremental discretization for supervised and unsupervised discretization. The first three columns present the results for unsupervised discretization and the last three, the results for supervised discretization. Supervised discretization takes advantage over unsupervised both on batch and incremental versions. For comparative purposes we present also the results from a discretization using only the first chunk of data. It is interesting to observe that unsupersived has some advantages over supervised. This can be due to the fact that supervised discretization needs more examples to perform a good discretization than its unsupervised version. C. Comparison between batch and incremental discretizations In Table IV we can analyze the difference of performances between the batch discretization, the initial discretization (with only 10% of the data) and the incremental discretization. Table I presents a summary of the significance of differences using Wilcoxon test. We can see the importance of the discretization. It is clear that a batch discretization outperforms an initial discretization. But we can see that with a incremental discretization it is possible to improve the results of an initial discretization. The average error in all datasets shows that the incremental discretization improves the accuracy of the same algorithm using the initial discretization. From these results we can conclude that batch discretization is an optimistic method to evaluate incremental discretization; PiD is able to approach batch discretization. In these datasets, most differences are not significant. D. Varying the number of examples per chunk Table II present the results of TANi with an incremental discretization for different sizes of chunks. The size of chunks varies from 10% of data to 1% of the data. We would like to emphasize that to build a model and a discretization with 1% of data is extremely difficult. But even in this rough conditions, the results are near the batch algorithm. The accuracy improves when the number of instances per chunk grows, an expected result. Nevertheless, as we have pointed out before, PiD depends on the initial discretization. When the first chunk of instances grows then it constitutes a good base. A set of experiments, not reported here due to lack of space, indicated that PiD seems to be resilient to the order in which the examples are given. E. A real-world Application In the dataset TrawlHauls (Fig. 3) the goal is to predict when a trawl vessel is fishing or not from GPS information [1]. The 171 Discretization Data set Supervised Supervised Supervised Supervised Supervised Supervised Batch Batch Incremental Incremental Incremental Incremental TAN TANi TANi TANi TANi TANi Chunks 10% Chunks 10% Chunks 5% Chunks 2% Chunks 1% 82.17±8.14 Australian 87.10±4.12 86.23±4.11 85.80±4.47 85.65±2.69 85.36±4.85 Pima 76.44±4.25 75.26±4.09 74.98±6.38 72.52±3.45 74.08±4.36 73.18±5.84 Diabetes 75.01±2.65 75.13±3.23 73.83±3.86 73.70±4.58 73.44±4.87 73.57±3.57 Tokyo 91.55±2.43 92.28±1.50 91.34±2.16 92.39±2.31 88.11±4.09 89.05±3.87 German 70.70±3.59 71.50±3.81 71.88±4.98 72.10±3.75 71.40±3.60 72.90±4.86 Segmentation 94.98±1.45 94.68±1.37 93.90±1.56 92.38±2.69 91.39±2.01 91.73±1.36 Waveform 80.68±1.41 80.80±2.02 80.04±1.34 77.92±1.99 77.40±1.51 77.42±1.42 Churn 89.02±1.92 87.04±1.44 86.84±1.73 86.02±1.21 86.36±1.49 86.28±1.23 Satellite Image 87.65±1.30 87.89±1.24 85.86±1.67 86.22±0.96 85.77±1.43 84.04±1.53 Adult 83.37±0.56 85.66±0.54 84.72±0.42 84.29±0.49 84.31±0.52 84.23±0.57 Shuttle 99.93±0.04 99.86±0.07 99.78±0.08 99.77±0.07 99.69±0.07 99.63±0.12 Average 85.07 85.05 84.45 83.91 83.39 83.11 TABLE II E XPERIMENTAL RESULTS OF BATCH ALGORITHMS AND P I D WITH DIFFERENT SIZE OF CHUNKS Feature Att1 Att2 Att3 Att4 First layer with equal frequency discretization Cut Points 4.95;5.15;5.35;5.45;5.55;5.65;5.75;5.86;5.95;6.25;6.40;6.80 2.35;2.45;2.60;2.75;2.95;3.05;3.15;3.25;3.35;3.45;3.60;3.75 1.35;1.45;1.55;2.60;3.80;4.10;4.30;4.55;4.80;5.05;5.45;5.65 0.25;0.75;1.05;1.15;1.20;1.25;1.35;1.45;1.65;1.75;1.95;2.15 Num. Points 12 12 12 12 Second layer with equal frequency discretization (Unsupervised discretization) Feature Cut points after 30 instances 60 instances 90 instances 120 instances 150 instances Att1 5.55 5.65 5.65 5.75 5.75 Att2 2.95 2.95 2.95 2.95 2.95 Att3 4.30 4.10 4.10 4.10 4.30 Att4 1.25 1.25 1.25 1.25 1.25 Feature Att1 Att2 Att3 Att4 Num. of Intervals 2 2 2 2 Second layer with recursive entropy discretization (Supervised discretization) Cut points after Num. Points 30 instances 60 instances 90 instances 120 instances 150 instances 5.45 5.55 5.86 5.45 5.55 2 2.75 3.25 3.25 3.25 3.05 2 2.60; 5.05 2.60; 5.05 2.60; 5.05 2.60; 4.80 2.60; 4.80 3 0.75; 1.65 0.75; 1.65 0.75; 1.65 0.75; 1.75 0.75; 1.75 3 TABLE III R ESULTS OF SUPERVISED AND UNSUPERVISED DISCRETIZATION WITH I RIS DATA SET attributes are derived from a window of 11 consecutive points (five before the decision point and 5 after). 11 points are the direction and the other 11 are the speed of the vessel. This is a two class problem, defined by 22 attributes. From all the available data, 675.329 examples, we randomly select 10% for test and the remainder (607886 examples) were used for training. Using this training set and test set the results of the 3 PiD discretization methods using a naive Bayes classifier are: Entropy: 6.86%, Equal-Width: 6.91%, and Equal-frequency: 6.93%. For comparative purposes the error of naive Bayes updatable in WEKA [10] is 9.11% 2 . V. C ONCLUSION AND F UTURE W ORK In this paper we present a new method for incremental discretization. Although incremental learning is an hot topic 2 Other variants, including the discretization procedures implemented in Weka didn’t run due to lack of memory. in machine learning and discretization is a fundamental preprocessing step for some well-known algorithms, the topic of incremental discretization has received few attention from the community. We have introduced a new discretization method that works in two layers. This two-stage architecture is very flexible. It can be used as supervised or unsupervised. For the second layer any base discretization method can be used. The most relevant aspect is that the boundaries of the intervals of the second layer could change when new data is available. We also note that this method can save memory and computer time. Since discretization simplifies and reduces data storage, having a first layer with equal frequency as preparation of a second layer with more complex methods like recursive entropy discretization or chi-merge, it can be faster than methods that apply to the whole dataset. The main advantage of the proposed method is the ability to process training examples in 172 [12] R. Kohavi and M. Sahami. Error-based and entropy-based discretization of continuous features. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 114–119, 1996. [13] M. J. Pazzani. An iterative improvement approach for the discretization of numeric attributes in bayesian classifiers. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, 1995. [14] R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc. San Mateo, CA, 1993. [15] J. Roure. Incremental Methods for Bayesian Network Structure Learning. PhD thesis, Universidad Politécnica de Cataluna, 2004. [16] Y. Yang. Discretization for Naive-Bayes Learning. PhD thesis, School of Computer Science and Software Engineering of Monash University, July 2003. Fig. 3. Illustrative figure for ’trawl haul’ problem. The bottom panel shows the speed of the vessel and predicted trawl hauls. The blue lines show the predictions and the black lines show the . trawl hauls identified by the user a single scan over the data. PiD processes examples in constant time and space, even for infinite sequences of streaming data. ACKNOWLEGEMENTS This work was developed under project Adaptive Learning Systems II (POSI/EIA/55340/2004). R EFERENCES [1] M. Afonso-Dias, J. Simoes, and C. Pinto. A dedicated gis to estimate and map fishing effort and landings for the portuguese crustacean trawl fleet. In T. Nishida, P. J. Kaiola, and C. E. Hollingworth, editors, GIS/Spatial Analyses in Fishery and Aquatic Sciences, volume 2, pages 323–340. Fishery-Aquatic GIS Research Group, Saitama, Japan, 2004. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. [3] P. Domingos and M. J. Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2-3):103–130, 1997. [4] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In Proceedings 12th International Conference on Machine Learning, pages 194–202. Morgan and Kaufmann, 1995. [5] Tapio Elomaa and Juho Rousu. Necessary and suficient pre-processing in numerical range discretization. Knowledge and Information Systems (2003) 5: 162 182, 5:162–182, 2003. [6] U. M. Fayyad and K. B Irani. Multi-interval discretization of continuous valued attributes for classification learning. In 13 International Joint Conference on Artificial Intelligence, pages 1022–1027. Morgan Kaufmann, 1993. [7] N. Friedman and M. Goldszmidt. Building classifiers using bayesian networks. In AAAI/IAAI, Vol. 2, pages 1277–1284, 1996. [8] Raul Giraldez, Jesus S. Aguilar-Ruiz, Jose C. Riquelme, Francisco J. Ferrer-Troyano, and Domingo S. Rodriguez-Baena. Discretization oriented to decision rule generation. In 6th International Conference on Knowledge-Based Intelligent Information Engineering Systems, pages 275–279. IOS Press, 2002. [9] S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In STOC ’01: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 471–475. ACM Press, 2001. [10] E. Frank I. H. Witten. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, Inc., 1999. [11] R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI, pages 1137–1145. Morgan and Kaufmann, 1995. 173 174 19 40 16 36 6 2310 5000 5000 6435 45222 58000 Segmentation Waveform Churn Satellite Image Adult Shuttle 0 8 0 4 1 1 7 0 0 1 6 1 1 0 9 7 0 0 Discrete 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Class Values 84.36 83.09 99.31±0.13 83.30±0.65 85.75±1.10 85.86±1.12 79.68±1.91 90.74±2.31 72.30±4.14 90.93±2.86 73.18±2.98 72.52±4.02 86.81±3.71 80.23±10.0 92.34±7.04 66.13±9.71 82.79±4.22 78.52±6.94 82.56±3.98 92.68±8.01 TABLE IV 99.68±0.05 83.52±0.55 86.81±1.32 84.46±1.46 79.28±1.25 92.21±1.31 71.80±3.16 90.82±2.24 75.79±4.33 74.75±3.33 86.52±4.04 96.19±2.21 98.46±1.79 65.52±6.50 78.72±6.37 80.00±7.85 84.08±4.11 89.90±7.31 83.29 98.82±0.09 83.32±0.62 89.65±1.56 83.94±1.42 81.10±1.49 89.65±1.56 72.70±3.47 91.35±2.82 72.39±3.89 72.39±4.54 85.65±3.64 87.71±3.37 91.06±7.18 65.80±5.70 82.09±4.24 77.78±8.73 83.32±4.02 90.46±6.03 TANi Chunks 10% TANi Chunks 10% TANi Incremental Initial Batch Unsupervised Unsupervised Unsupervised 85.14 99.86±0.08 85.66±0.54 87.89±1.24 87.04±1.44 80.80±2.02 94.68±1.37 71.50±3.81 92.28±1.50 75.13±3.23 75.26±4.09 86.23±4.11 96.49±2.10 97.18±3.30 70.09±8.34 78.06±4.19 77.04±7.77 83.00±2.63 94.41±7.41 TANi Batch Supervised 82.81 99.80±0.05 84.74±0.43 86.73±1.60 86.6±1.35 79.68±1.91 77.49±1.80 72.30±4.14 90.09±2.62 73.05±5.06 71.99±4.17 85.94±3.81 96.19±2.96 91.06±6.64 60.33±11.1 80.46±6.35 80.37±8.38 82.25±3.70 91.57±5.42 TANi Chunks 10% Initial Supervised Supervised 84.37 99.78±0.08 84.72±0.42 85.86±1.67 86.84±1.73 80.04±1.34 93.90±1.56 71.88±4.98 91.34±2.16 73.83±3.86 74.98±6.38 85.80±4.47 97.51±2.30 91.56±6.18 66.06±5.90 80.07±5.60 79.63±9.11 82.09±3.93 92.68±5.32 TANi Chunks 10% Incremental E XPERIMENTAL RESULTS OF BATCH ALGORITHMS AND P I D WITH SUPERVISED AND UNSUPERVISED INCREMENTAL DISCRETIZATION METHOD Average of accuracy 9 44 13 959 1000 8 768 Diabetes German 8 768 Tokyo 8 690 BreastLoss Pima 7 31 393 569 Cars Australian 4 6 296 345 6 270 Heart Cleve 33 194 Liver-disorder 13 178 Continuous Wine Instances Attributes CRX Data Set Number Discretization