Download ubdm-2006-slides

Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University Outline  Introduction    Experimental Methodology Results     Motivation, cost model Adult data set Progressive Sampling Related Work Future Work/Conclusion August 20, 2006 UBDM 2006 Workshop 2 Motivation  Utility-Based Data Mining   Concerned with utility of overall data mining process A key cost is the cost of training data   First ones to analyze the impact of a very simple cost model   These costs often ignored (except for active learning) In doing so we fill a hole in existing research Our cost model  A fixed cost for acquiring labeled training examples    No separate cost for class labels, missing features, etc. Turney1 called this the “cost of cases” No control over which training examples chosen  No active learning August 20, 2006 UBDM 2006 Workshop 3 Motivation (cont.)  Efficient progressive sampling2  Determines “optimal” training set size Optimal is where the learning curve reaches a plateau  Assumes data acquisition costs are essentially zero   What if the acquisition costs are significant? August 20, 2006 UBDM 2006 Workshop 4 Motivating Examples  Predicting customer behavior/buying potential    Training data from D&B and Ziff-Davis These and other “information vendors” make money by selling information Poker playing  Learn about an opponent by playing him August 20, 2006 UBDM 2006 Workshop 5 Outline  Introduction    Experimental Methodology Results     Motivation, cost model Adult data set Progressive Sampling Related Work Future Work/Conclusion August 20, 2006 UBDM 2006 Workshop 6 Experiments  Use C4.5 to determine relationship between accuracy and training set size    Random sampling to reduce training set size For this talk we focus on adult data set    20 runs used to increase reliability of results ~ 21,000 examples We utilize a predetermined sampling schedule CPU times recorded, mainly for future work August 20, 2006 UBDM 2006 Workshop 7 Measuring Total Utility  Total cost = Data Cost + Error Cost = n∙Ctr + e ∙|S| ∙Cerr n = number training examples e = error rate |S| = number examples in score set Ctr = cost of a training example Cerr = cost of an error    Will know n and e for any experiment With domain knowledge can estimate Ctr, Cerr, |S| But we don’t have this knowledge Treat Ctr and Cerr as parameters and vary them  Assume |S| = 100 with no loss of generality   August 20, 2006 If |S| is 100,000 then look at results for Cerr/1,000 UBDM 2006 Workshop 8 Measuring Total Utility (cont.)  Now only look at cost ratio, Ctr:Cerr    Typical values evaluated: 1:1, 1:1000, etc. Relative cost ratio is Cerr/Ctr Example  If cost ratio is 1:1000 then even trade-off if buying 1000 training examples eliminates 1 error  Alternatively: buying 1000 examples is worth a 1% reduction in error rate (then can ignore |S| = 100) August 20, 2006 UBDM 2006 Workshop 9 Outline  Introduction    Experimental Methodology Results     Motivation, cost model Adult data set Progressive Sampling Related Work Future Work/Conclusion August 20, 2006 UBDM 2006 Workshop 10 Learning Curve Accuracy (%) 87 84 No plateau change = 0.3% 81 78 75 0 3,000 6,000 9,000 12,000 15,000 Training Set Size August 20, 2006 UBDM 2006 Workshop 11 Utility Curves 180,000 10:1 150,000 Total Cost 1:7500 120,000 1:5000 90,000 1:3000 60,000 1:1000 30,000 1:1 0 0 4,000 8,000 12,000 16,000 Training Set Size August 20, 2006 UBDM 2006 Workshop 12 Utility Curves (Normalized Cost) 100% 1:5000 Normalized Cost 80% 60% 1:1000 40% 1:100 1:50,000 1:500 20% 1:10 0% 0 4,000 8,000 12,000 16,000 Training Set Size August 20, 2006 UBDM 2006 Workshop 13 Optimal Training Set Size Curve Optimal Training Set Size 15,000 85.9% 85.8% 12,000 9,000 85.6% 85.4% 6,000 85.1% 3,000 84.8% Note: accuracy shown near data point 0 0 10,000 20,000 30,000 40,000 Relative Cost August 20, 2006 UBDM 2006 Workshop 14 Value of Optimal Curve  Even without specific cost information, this chart could be useful for a practitioner   Can put bounds on appropriate training set size Analogous to Drummond and Holte’s cost curves3 They looked at cost ratio of false positives and negatives  We look at cost ratio of errors vs. cost of data   Both types of curves allows the practitioner to understand the impact of the various costs August 20, 2006 UBDM 2006 Workshop 15 Idealized learning curve Accuracy 100 90 accuracy = training size/training size + 1 80 0 1,000 2,000 3,000 4,000 5,000 Optimal Training Set Size (K) Training Set Size 100 80 60 40 20 optimal = 10sqroot(RC) -1 0 0 20 40 60 Relative Cost August 20, 2006 UBDM 2006 Workshop 80 100 Millions 16 Outline  Introduction    Experimental Methodology Results     Motivation, cost model Adult data set Progressive Sampling Related Work Future Work/Conclusion August 20, 2006 UBDM 2006 Workshop 17 Progressive Sampling  We want to find the optimal training set size   Need to determine when to stop acquiring data before acquiring all of it! Strategy: use a progressive sampling strategy  Key issues: When do we stop?  What sampling schedule should we use?  August 20, 2006 UBDM 2006 Workshop 18 Our Progressive Sampling Strategy  We stop after first increase in total cost   Results therefore never optimal, but near-optimal if learning curve is non-decreasing We evaluate 2 simple sampling schedules     S1: 10, 50, 100, 500, 1000, 2000, …, 9000, 10,000, 12,000, 14,000, … S2: 50, 100, 200, 400, 800, 1600, … S2 & S1 are similar given modest sized data sets Could use an adaptive strategy August 20, 2006 UBDM 2006 Workshop 19 Adult Data Set: S1 vs. Straw Man Total Cost (100K) 1.5 1.0 0.5 Straw Man Strategy S1 Strategy 0.0 1:1 1:20 1:500 1:1000 1:5000 1:10000 Cost Ratio August 20, 2006 UBDM 2006 Workshop 20 Progressive Sampling Conclusions  We can use progressive sampling to determine a near optimal training set size    Effectiveness mainly based on how well behaved the learning curve is (i.e., non-decreasing) Sampling schedule/batch size is also important Finer granularity requires more CPU time But if data costly, CPU time most likely less expensive  In our experiments, cumulative CPU time < 1 minute  August 20, 2006 UBDM 2006 Workshop 21 Related Work  Efficient progressive sampling2   It tries to efficiently find the asymptote That work has a data cost of ε   Stop only when added data has no benefit Active Learning  Similar in that data cost is factored in but setting different User has control over which examples are selected or features measured  Does not address simple “cost of cases” scenario   Find best class distribution when training data costly4   Assumes training set size limited but size pre-specified Finds the best class distribution to maximize performance August 20, 2006 UBDM 2006 Workshop 22 Limitations/Future Work  Improvements:         Bigger data sets where learning curve plateaus More sophisticated sampling schemes Incorporate cost-sensitive learning (cost FP ≠ FN) Generate better behaved learning curves Include CPU time in utility metric Analyze other cost models Study the learning curves Real world motivating examples  August 20, 2006 Perhaps with cost information UBDM 2006 Workshop 23 Conclusion   We analyze impact of training data cost on classification process Introduce new ways of visualizing the impact of data cost    Utility curves Optimal training set size curves Show that we can use progressive sampling to help learn a near-optimal classifier August 20, 2006 UBDM 2006 Workshop 24 We Want Feedback  We are continuing this work  Clearly many minor enhancements possible     Feel free to suggest some more Any major new directions/extensions? What if anything is most interesting? Any really good motivating examples that you are familiar with August 20, 2006 UBDM 2006 Workshop 25 Questions?  If I have run out of time, please find me during the break!! August 20, 2006 UBDM 2006 Workshop 26 References 1. P. Turney (2000). Types of cost in inductive concept learning. Workshop on Cost-Sensitive Learning at the 17th International Conference on Machine Learning. 2. F. Provost, D. Jensen & T. Oates (1999). Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. 3. C. Drummond & R. Holte (2000). Explicitly Representing Expected Cost: An Alternative to ROC Representation. Proceedings of the 6th ACM SIGKDD International Conference of Knowledge Discovery and Data Mining, 198-207. 4. G. Weiss & F. Provost (2003). Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19:315-354. August 20, 2006 UBDM 2006 Workshop 27 Learning Curves for Large Data Sets 90 adult Accuracy (%) 80 network1 70 blackjack 60 coding boa1 50 0 3,000 6,000 9,000 12,000 15,000 Training Set Size August 20, 2006 UBDM 2006 Workshop 28 Optimal Curves for Large Data Sets Optimal Training Set Size 15,000 12,000 9,000 coding 6,000 blackjack network1 3,000 boa1 0 0 5,000 10,000 15,000 20,000 Relative Cost August 20, 2006 UBDM 2006 Workshop 29 Learning Curves for Small Data Sets 100 kr-vs-kp Accuracy (%) 90 breast-wisc 80 crx german 70 60 move 50 0 500 1,000 1,500 2,000 2,500 Training Set Size August 20, 2006 UBDM 2006 Workshop 30 Optimal Curves for Small Data Sets Optimum Training Set Size 2,500 2,000 move kr-vs-kp 1,500 german 1,000 500 breast-wisc crx 0 0 500 1,000 1,500 2,000 2,500 3,000 3,500 Relative Cost August 20, 2006 UBDM 2006 Workshop 31 Results for Adult Data Set Relative Optimal-S1 Cost Ratio Size Cost CPU 1 10 34 0.00 10 10 25 0.00 20 500 2,233 0.20 200 500 3,966 0.20 500 500 9,165 0.20 5,000 5,000 79,450 4.17 10,000 9,000 152,900 9.15 15,000 9,000 224,850 9.15 20,000 9,000 296,800 9.15 50,000 15,960 721,460 20.89 August 20, 2006 Size 50 50 50 1,000 2,000 6,000 7,000 7,000 7,000 7,000 S1 Cost 74 292 2,470 4,266 9,945 79,800 154,700 228,550 302,400 745,500 UBDM 2006 Workshop CPU 0.00 0.00 0.00 0.53 1.23 5.27 6.48 6.48 6.48 6.48 Size 100 100 100 800 1,600 12,800 12,800 15,960 15,960 15,960 S2 Cost 122 319 538 4,060 9,480 83,700 154,600 226,860 297,160 718,960 CPU 0.00 0.00 0.00 0.40 0.92 14.84 14.84 20.88 20.88 20.88 32 Optimal vs. S1 for Large Data Sets Relative Cost Ratio 1 20 500 1,000 5,000 10,000 15,000 20,000 50,000 August 20, 2006 Increase In Total Cost: S1 vs. S1-optimal Adult Blackjack Boa1 Coding Network1 115.7% 10.6% 8.5% 3.2% 0.4% 1.2% 1.6% 1.9% 3.3% 53.2% 34.6% 1.0% 2.6% 1.4% 1.1% 1.6% 1.9% 0.7% 70.1% 5.1% 1.2% 2.3% 4.7% 5.9% 6.3% 6.5% 6.9% UBDM 2006 Workshop 62.8% 2.0% 2.1% 0.6% 0.2% 0.0% 0.0% 0.0% 0.0% 91.0% 0.7% 2.7% 3.6% 1.5% 1.3% 1.2% 1.1% 1.0% 33

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ubdm-2006-slides