Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University Outline Introduction Experimental Methodology Results Motivation, cost model Adult data set Progressive Sampling Related Work Future Work/Conclusion August 20, 2006 UBDM 2006 Workshop 2 Motivation Utility-Based Data Mining Concerned with utility of overall data mining process A key cost is the cost of training data First ones to analyze the impact of a very simple cost model These costs often ignored (except for active learning) In doing so we fill a hole in existing research Our cost model A fixed cost for acquiring labeled training examples No separate cost for class labels, missing features, etc. Turney1 called this the “cost of cases” No control over which training examples chosen No active learning August 20, 2006 UBDM 2006 Workshop 3 Motivation (cont.) Efficient progressive sampling2 Determines “optimal” training set size Optimal is where the learning curve reaches a plateau Assumes data acquisition costs are essentially zero What if the acquisition costs are significant? August 20, 2006 UBDM 2006 Workshop 4 Motivating Examples Predicting customer behavior/buying potential Training data from D&B and Ziff-Davis These and other “information vendors” make money by selling information Poker playing Learn about an opponent by playing him August 20, 2006 UBDM 2006 Workshop 5 Outline Introduction Experimental Methodology Results Motivation, cost model Adult data set Progressive Sampling Related Work Future Work/Conclusion August 20, 2006 UBDM 2006 Workshop 6 Experiments Use C4.5 to determine relationship between accuracy and training set size Random sampling to reduce training set size For this talk we focus on adult data set 20 runs used to increase reliability of results ~ 21,000 examples We utilize a predetermined sampling schedule CPU times recorded, mainly for future work August 20, 2006 UBDM 2006 Workshop 7 Measuring Total Utility Total cost = Data Cost + Error Cost = n∙Ctr + e ∙|S| ∙Cerr n = number training examples e = error rate |S| = number examples in score set Ctr = cost of a training example Cerr = cost of an error Will know n and e for any experiment With domain knowledge can estimate Ctr, Cerr, |S| But we don’t have this knowledge Treat Ctr and Cerr as parameters and vary them Assume |S| = 100 with no loss of generality August 20, 2006 If |S| is 100,000 then look at results for Cerr/1,000 UBDM 2006 Workshop 8 Measuring Total Utility (cont.) Now only look at cost ratio, Ctr:Cerr Typical values evaluated: 1:1, 1:1000, etc. Relative cost ratio is Cerr/Ctr Example If cost ratio is 1:1000 then even trade-off if buying 1000 training examples eliminates 1 error Alternatively: buying 1000 examples is worth a 1% reduction in error rate (then can ignore |S| = 100) August 20, 2006 UBDM 2006 Workshop 9 Outline Introduction Experimental Methodology Results Motivation, cost model Adult data set Progressive Sampling Related Work Future Work/Conclusion August 20, 2006 UBDM 2006 Workshop 10 Learning Curve Accuracy (%) 87 84 No plateau change = 0.3% 81 78 75 0 3,000 6,000 9,000 12,000 15,000 Training Set Size August 20, 2006 UBDM 2006 Workshop 11 Utility Curves 180,000 10:1 150,000 Total Cost 1:7500 120,000 1:5000 90,000 1:3000 60,000 1:1000 30,000 1:1 0 0 4,000 8,000 12,000 16,000 Training Set Size August 20, 2006 UBDM 2006 Workshop 12 Utility Curves (Normalized Cost) 100% 1:5000 Normalized Cost 80% 60% 1:1000 40% 1:100 1:50,000 1:500 20% 1:10 0% 0 4,000 8,000 12,000 16,000 Training Set Size August 20, 2006 UBDM 2006 Workshop 13 Optimal Training Set Size Curve Optimal Training Set Size 15,000 85.9% 85.8% 12,000 9,000 85.6% 85.4% 6,000 85.1% 3,000 84.8% Note: accuracy shown near data point 0 0 10,000 20,000 30,000 40,000 Relative Cost August 20, 2006 UBDM 2006 Workshop 14 Value of Optimal Curve Even without specific cost information, this chart could be useful for a practitioner Can put bounds on appropriate training set size Analogous to Drummond and Holte’s cost curves3 They looked at cost ratio of false positives and negatives We look at cost ratio of errors vs. cost of data Both types of curves allows the practitioner to understand the impact of the various costs August 20, 2006 UBDM 2006 Workshop 15 Idealized learning curve Accuracy 100 90 accuracy = training size/training size + 1 80 0 1,000 2,000 3,000 4,000 5,000 Optimal Training Set Size (K) Training Set Size 100 80 60 40 20 optimal = 10sqroot(RC) -1 0 0 20 40 60 Relative Cost August 20, 2006 UBDM 2006 Workshop 80 100 Millions 16 Outline Introduction Experimental Methodology Results Motivation, cost model Adult data set Progressive Sampling Related Work Future Work/Conclusion August 20, 2006 UBDM 2006 Workshop 17 Progressive Sampling We want to find the optimal training set size Need to determine when to stop acquiring data before acquiring all of it! Strategy: use a progressive sampling strategy Key issues: When do we stop? What sampling schedule should we use? August 20, 2006 UBDM 2006 Workshop 18 Our Progressive Sampling Strategy We stop after first increase in total cost Results therefore never optimal, but near-optimal if learning curve is non-decreasing We evaluate 2 simple sampling schedules S1: 10, 50, 100, 500, 1000, 2000, …, 9000, 10,000, 12,000, 14,000, … S2: 50, 100, 200, 400, 800, 1600, … S2 & S1 are similar given modest sized data sets Could use an adaptive strategy August 20, 2006 UBDM 2006 Workshop 19 Adult Data Set: S1 vs. Straw Man Total Cost (100K) 1.5 1.0 0.5 Straw Man Strategy S1 Strategy 0.0 1:1 1:20 1:500 1:1000 1:5000 1:10000 Cost Ratio August 20, 2006 UBDM 2006 Workshop 20 Progressive Sampling Conclusions We can use progressive sampling to determine a near optimal training set size Effectiveness mainly based on how well behaved the learning curve is (i.e., non-decreasing) Sampling schedule/batch size is also important Finer granularity requires more CPU time But if data costly, CPU time most likely less expensive In our experiments, cumulative CPU time < 1 minute August 20, 2006 UBDM 2006 Workshop 21 Related Work Efficient progressive sampling2 It tries to efficiently find the asymptote That work has a data cost of ε Stop only when added data has no benefit Active Learning Similar in that data cost is factored in but setting different User has control over which examples are selected or features measured Does not address simple “cost of cases” scenario Find best class distribution when training data costly4 Assumes training set size limited but size pre-specified Finds the best class distribution to maximize performance August 20, 2006 UBDM 2006 Workshop 22 Limitations/Future Work Improvements: Bigger data sets where learning curve plateaus More sophisticated sampling schemes Incorporate cost-sensitive learning (cost FP ≠ FN) Generate better behaved learning curves Include CPU time in utility metric Analyze other cost models Study the learning curves Real world motivating examples August 20, 2006 Perhaps with cost information UBDM 2006 Workshop 23 Conclusion We analyze impact of training data cost on classification process Introduce new ways of visualizing the impact of data cost Utility curves Optimal training set size curves Show that we can use progressive sampling to help learn a near-optimal classifier August 20, 2006 UBDM 2006 Workshop 24 We Want Feedback We are continuing this work Clearly many minor enhancements possible Feel free to suggest some more Any major new directions/extensions? What if anything is most interesting? Any really good motivating examples that you are familiar with August 20, 2006 UBDM 2006 Workshop 25 Questions? If I have run out of time, please find me during the break!! August 20, 2006 UBDM 2006 Workshop 26 References 1. P. Turney (2000). Types of cost in inductive concept learning. Workshop on Cost-Sensitive Learning at the 17th International Conference on Machine Learning. 2. F. Provost, D. Jensen & T. Oates (1999). Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. 3. C. Drummond & R. Holte (2000). Explicitly Representing Expected Cost: An Alternative to ROC Representation. Proceedings of the 6th ACM SIGKDD International Conference of Knowledge Discovery and Data Mining, 198-207. 4. G. Weiss & F. Provost (2003). Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19:315-354. August 20, 2006 UBDM 2006 Workshop 27 Learning Curves for Large Data Sets 90 adult Accuracy (%) 80 network1 70 blackjack 60 coding boa1 50 0 3,000 6,000 9,000 12,000 15,000 Training Set Size August 20, 2006 UBDM 2006 Workshop 28 Optimal Curves for Large Data Sets Optimal Training Set Size 15,000 12,000 9,000 coding 6,000 blackjack network1 3,000 boa1 0 0 5,000 10,000 15,000 20,000 Relative Cost August 20, 2006 UBDM 2006 Workshop 29 Learning Curves for Small Data Sets 100 kr-vs-kp Accuracy (%) 90 breast-wisc 80 crx german 70 60 move 50 0 500 1,000 1,500 2,000 2,500 Training Set Size August 20, 2006 UBDM 2006 Workshop 30 Optimal Curves for Small Data Sets Optimum Training Set Size 2,500 2,000 move kr-vs-kp 1,500 german 1,000 500 breast-wisc crx 0 0 500 1,000 1,500 2,000 2,500 3,000 3,500 Relative Cost August 20, 2006 UBDM 2006 Workshop 31 Results for Adult Data Set Relative Optimal-S1 Cost Ratio Size Cost CPU 1 10 34 0.00 10 10 25 0.00 20 500 2,233 0.20 200 500 3,966 0.20 500 500 9,165 0.20 5,000 5,000 79,450 4.17 10,000 9,000 152,900 9.15 15,000 9,000 224,850 9.15 20,000 9,000 296,800 9.15 50,000 15,960 721,460 20.89 August 20, 2006 Size 50 50 50 1,000 2,000 6,000 7,000 7,000 7,000 7,000 S1 Cost 74 292 2,470 4,266 9,945 79,800 154,700 228,550 302,400 745,500 UBDM 2006 Workshop CPU 0.00 0.00 0.00 0.53 1.23 5.27 6.48 6.48 6.48 6.48 Size 100 100 100 800 1,600 12,800 12,800 15,960 15,960 15,960 S2 Cost 122 319 538 4,060 9,480 83,700 154,600 226,860 297,160 718,960 CPU 0.00 0.00 0.00 0.40 0.92 14.84 14.84 20.88 20.88 20.88 32 Optimal vs. S1 for Large Data Sets Relative Cost Ratio 1 20 500 1,000 5,000 10,000 15,000 20,000 50,000 August 20, 2006 Increase In Total Cost: S1 vs. S1-optimal Adult Blackjack Boa1 Coding Network1 115.7% 10.6% 8.5% 3.2% 0.4% 1.2% 1.6% 1.9% 3.3% 53.2% 34.6% 1.0% 2.6% 1.4% 1.1% 1.6% 1.9% 0.7% 70.1% 5.1% 1.2% 2.3% 4.7% 5.9% 6.3% 6.5% 6.9% UBDM 2006 Workshop 62.8% 2.0% 2.1% 0.6% 0.2% 0.0% 0.0% 0.0% 0.0% 91.0% 0.7% 2.7% 3.6% 1.5% 1.3% 1.2% 1.1% 1.0% 33