Download Adaptive Sampling Methods for Scaling up Knowledge Discovery

Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) By Carlos Domingo et.al., Kruwer Academic Publishers (Summarized by Jinsan Yang, SNU Biointelligence Lab)  Abstract Methods for large amounts of data Adaptive sampling method instead of random sampling  Keywords Data Mining, Knowledge Discovery, Scalibility, Adaptive sampling, Concentration Bounds Outline  Introduction  General Rule Selection Problem  Adaptive Sampling Algorithm  An Application of Adaselect Problem and Algorithm Experiments  Concluding Remarks Introduction (1)  Analysis of Large data Redesign a known algorithm Reduce the data size  A typical task in data mining Finding or selecting some rules or laws (General Rule Selection) General Rule Selection: by random sampling (Batch Sampling) Proper sample size: by Concentration Bounds or Deviation bounds (Chernoff, Hoeffding bounds) Problems Immense sample size is needed for good accuracy and confidence For the batch sampling, the sample size should be determined a priori as the worst size and it is overestimated for most of the situations Introduction (2)  Overcoming Sampling in online sequential fashion (one by one or block by block) Adaptive sample sizes (adaptive sampling) General Rule Selection Problem  Given Date D (discrete, categorical ?) and model set H, Select a model h with maximum value of Utility U(h) (supervised learning) Adaptive Sampling Algorithm (1)  Extension of Hoeffding bound (0  U (h)  cd )  Reliability of Algorithm Adaptive Sampling Algorithm (2) An Application of Adaselect (1)  Can apply as a tool for the General rule selection problem  Example chosen: A boosting based classification algorithm that uses a simple decision stump learner as a base learner. Decision stump: a single-split decision tree. AdaBoost for boosting by sub-sampling or re-weighting. Apply adaptive sampling to base learner (boosting by filtering). Use MadaBoost by controlling the initial weight as bounded. An Application of Adaselect (2)  Algorithm Data: discrete instance vector with labels Classification rule: decision stump 0-1 error measure, U: Utility Function An Application of Adaselect (3)  Experiments Discretize by 5 intervals and treat missing value as another value. Artificial inflation (100 copies) of original UCI data Only for 2 classes 10 fold cross validation and the results are averaged over 10 runs Computer: cpu alpha 600MHz, 250Mb memory, 4.3 Gb Hard under Linux C4.5 and Naïve Bayes classifier for comparison Boosting round: 10 Number of all possible decision stumps: || H DS || (set of weighted majority of ten depth-1 decision tree) An Application of Adaselect (4) An Application of Adaselect (5) AdaSel is faster than C4.5 faster in large sample size. Concluding Remarks  Justification and efficiency analysis  Applied in the design of a base learner for a boosting algorithm

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Adaptive Sampling Methods for Scaling up Knowledge Discovery