Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
國立雲林科技大學 National Yunlin University of Science and Technology An Evaluation of Progressive Sampling for Imbalanced Data Sets Advisor : Dr. Hsu Presenter : Ai-Chen Liao Authors : Willie Ng and Manoranjan Dash 2006 . IEEE International Conference on Data Mining - Workshops 1 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Method Progressive Sampling (PS) Progressive Sampling with Over-sampling (PSOS) Experimental Result Conclusion Comments 2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation One of the emerging challenges for the data mining research community is to allow learning algorithms to mine huge databases. Even if a large data set is able to fit into memory, running a learning algorithm on the entire data set can be computationally expensive. One way of abating the cost is to train the learning algorithm by employing sampling techniques. 3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective We study the learning-curve sampling method, an approach for applying machine learning algorithms to massive amount of data sets. We present a refinement for progressive sampling which works well in practice and is able to converge to the desired sample size very quickly and accurately. 4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method ─ Progressive sampling ( PS ) Provost et al. [1] suggested using progressive sampling (PS) on a large data set. PS starts with a reasonably small sample and uses progressively larger ones until the accuracy of the learning algorithm no longer improves. Model performance Sample size Progressive sampling requires the defnition of at least three main components: (i) the sampling schedule, (ii) the initial data sample, (iii) the termination criterion. For instance: 100,200,300,400… For instance: 100,200,400,800… 5 Intelligent Database Systems Lab Method ─ Progressive Sampling with Over-sampling (PSOS) The notion of modifying PS is motivated by the experiments documented in [13]. In order to achieve good classification, the optimal class distribution of the training set should generally cover between 50% and 90% of the minority class example. 6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results N.Y.U.S.T. I. M. 7 Intelligent Database Systems Lab Experimental Results N.Y.U.S.T. I. M. 8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion In this paper, we study the efficiency of PS when applied to imbalanced data sets. In PSOS, we place emphasis on managing a balanced training set so as to speed up convergence as well as improve overall accuracy. [PSOS converges on average about 2 iterations earlier than CPS] 9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Comments Advantage Drawback … … Application Handling imbalanced data 10 Intelligent Database Systems Lab