Download Adaptive Sampling Methods for Scaling up Knowledge Discovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Adaptive Sampling Methods for
Scaling up Knowledge Discovery
Algorithms
From Ch 8 of Instace selection and Costruction for Data Mining (2001)
By Carlos Domingo et.al., Kruwer Academic Publishers
(Summarized by Jinsan Yang, SNU Biointelligence Lab)
 Abstract
Methods for large amounts of data
Adaptive sampling method instead of random sampling
 Keywords
Data Mining, Knowledge Discovery, Scalibility, Adaptive sampling,
Concentration Bounds
Outline
 Introduction
 General Rule Selection Problem
 Adaptive Sampling Algorithm
 An Application of Adaselect
Problem and Algorithm
Experiments
 Concluding Remarks
Introduction (1)
 Analysis of Large data
Redesign a known algorithm
Reduce the data size
 A typical task in data mining
Finding or selecting some rules or laws (General Rule Selection)
General Rule Selection: by random sampling (Batch Sampling)
Proper sample size: by Concentration Bounds or Deviation bounds
(Chernoff, Hoeffding bounds)
Problems
Immense sample size is needed for good accuracy and confidence
For the batch sampling, the sample size should be determined a
priori as the worst size and it is overestimated for most of the
situations
Introduction (2)
 Overcoming
Sampling in online sequential fashion (one by one or block by
block)
Adaptive sample sizes (adaptive sampling)
General Rule Selection Problem
 Given Date D (discrete, categorical ?) and model set H,
Select a model h with maximum value of Utility U(h)
(supervised learning)
Adaptive Sampling Algorithm (1)
 Extension of Hoeffding bound
(0  U (h)  cd )
 Reliability of Algorithm
Adaptive Sampling Algorithm (2)
An Application of Adaselect (1)
 Can apply as a tool for the General rule selection problem
 Example chosen: A boosting based classification algorithm
that uses a simple decision stump learner as a base learner.
Decision stump: a single-split decision tree.
AdaBoost for boosting by sub-sampling or re-weighting.
Apply adaptive sampling to base learner (boosting by filtering).
Use MadaBoost by controlling the initial weight as bounded.
An Application of Adaselect (2)
 Algorithm
Data: discrete instance vector with labels
Classification rule: decision stump
0-1 error measure, U: Utility Function
An Application of Adaselect (3)
 Experiments
Discretize by 5 intervals and treat missing value as another value.
Artificial inflation (100 copies) of original UCI data
Only for 2 classes
10 fold cross validation and the results are averaged over 10 runs
Computer: cpu alpha 600MHz, 250Mb memory, 4.3 Gb Hard
under Linux
C4.5 and Naïve Bayes classifier for comparison
Boosting round: 10
Number of all possible decision stumps: || H DS ||
(set of weighted majority of ten depth-1 decision tree)
An Application of Adaselect (4)
An Application of Adaselect (5)
AdaSel is faster than C4.5
faster in large sample size.
Concluding Remarks
 Justification and efficiency analysis
 Applied in the design of a base learner for a boosting
algorithm
Related documents