Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Incremental Classification Using TreeBased Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection and Construction for Data Mining, Chapter 11, pp. 189-206 Cho, Dong-Yeon (C) 2001 SNU CSE Biointelligence Lab Introduction Large Scale Applications Building classifiers on large datasets requires development of new techniques that minimize the overall execution time with comparable quality. Constructing a classifier for large growing dataset from scratch would be extremely wasteful and computationally prohibitive. ICE (Incremental Classification for Ever-growing large datasets) Incrementally building decision tree classifiers for lager growing datasets using tree-based sampling techniques (C) 2001 SNU CSE Biointelligence Lab Decision Tree Classification Easily interpreted and comprehended by human beings, and constructed relatively fast No prior knowledge of domains or distributions on the data, but good classification accuracy Construction phase and pruning phase (C) 2001 SNU CSE Biointelligence Lab Related Work Test Criterion for Partitioning a Set of Records ID3, C4.5 CART, SPRINT CLOUD Discretization and Sampling for Large Datasets Windowing Stratification Rainforest Framework BOAT algorithm (C) 2001 SNU CSE Biointelligence Lab Incremental Classification Problem Definition Terms A decision tree T A dataset D of size N A new incremental dataset d of size n A new decision tree T´ The goal is to build T´ with minimal cost such that T´ is comparably as accurate as the decision tree built for D+d from scratch. (C) 2001 SNU CSE Biointelligence Lab Impurity-based Split Selection Methods The gini index m gini ( S ) 1 ( n: i 1 Ci 2 ) n total number of records in a set S Ci: the number of records that belong to class i m: the number of classes S is split into two subsets, S1 and S2, with n1 and n2 records. n1 n2 ginisplit ( S ) gini ( S1 ) gini ( S 2 ) n n It does not lend itself for reuse in accordance with the changes in the number of records. (C) 2001 SNU CSE Biointelligence Lab Weighted Gini Index A weight can be assigned to a record when the record represents a number of other records. The motivation for the weighted gini index Replace similar records by a weighted representative and use them in an incremental batch learning A set S´ of n sorted numerical points in nondecreasing order point (call w-point) associated with a weight k (k 1). The total weight sum W of S´ Each n W wj n j 1 (C) 2001 SNU CSE Biointelligence Lab Weighted gini index m Qi 2 ) i 1 W W W w ginisplit ( S ) 1 gini w ( S1) 2 gini w ( S 2 ) W W gini ( S ) 1 ( w Q i: the weight sum of the w-points that belong to class i in S´. W1 and W2: the weight sums of subset S´1 and S´2, respectively Lemma 1 giniwsplit index value at a clear-cut splitting point x on a dimension is consistent with the original gini index value ginisplit at the same splitting point. The Lemma 2 giniwsplit index value is consistent with the original gini index value ginisplit with respect to the splitting condition The (C) 2001 SNU CSE Biointelligence Lab ICE: A Scalable Method for Incremental Classification Characteristics of ICE Size scalability, easy parallelization, inherent migration into distributed environment, temporal scalability, algorithm independent, minimal data access, and flexibility in dealing with addition and deletion of records as partition (C) 2001 SNU CSE Biointelligence Lab Sampling for Incremental Classification Sampling from Decision Tree One of the problem with random sampling is that it is sensitive to the distribution of data. The decision tree built on a dataset can be used to generate a good sample of the dataset. Such a sampling method is expected to result in a new decision tree with better or comparable quality than random sampling does. (C) 2001 SNU CSE Biointelligence Lab Sampling Techniques for Incremental Classification Random sampling ICE-random: Weighted samples are chosen randomly from each node of the decision tree built for the current incremental partition. Stratified sampling: Samples are randomly chosen based on the class labels in proportion to the size of the class in each node. Local clustering ICE-local: A clustering algorithm can be applied to the records in each node of an intermediate decision tree to obtain good weighted samples. ICE-global: A clustering algorithm can be applied to each incremental portion to extract samples for incremental classification. (C) 2001 SNU CSE Biointelligence Lab Cost of Incremental Classification Using ICE The Cost for ICE TICE ( A, D) T ( A, Di ) r Di T ( A,U i ) T(A,Di): The cost function for building a decision tree classifier for a data set Di r: sampling rate (0 < r 1) The ratio R of the cost of ICE to the cost of building a decision tree from scratch R It T ( A, Di ) r Di T ( A,U i ) T ( A, D) converges to r for a large i under some assumptions. (C) 2001 SNU CSE Biointelligence Lab Empirical Results Experimental Results with Real Datasets Records represent characteristics of hand-written alphabet letters. 15,000 and 5,000 records of 16 attributes for training and test datasets, respectively. 3 classes and 5 epochs ICE-random is omitted on purpose Its results is not comparably better than random sampling from the entire dataset. K-means clustering algorithm (C) 2001 SNU CSE Biointelligence Lab Experimental results for letter dataset with five epochs (C) 2001 SNU CSE Biointelligence Lab The quality measure for letter dataset at 15% sampling rate (C) 2001 SNU CSE Biointelligence Lab Experimental Results with Large Synthetic Datasets 72MB with 1 million records A smaller dataset of 100,000 records was created and used to approximate the accuracy of the dataset for comparison purpose. (C) 2001 SNU CSE Biointelligence Lab Important Implications of ICE ICE can be applied to large datasets. ICE can run on any of various computing configurations. ICE can be utilized for dynamically growing large datasets. Larger sample size may improve the overall accuracy results of ICE, however, sampling rate need not be significantly high. (C) 2001 SNU CSE Biointelligence Lab