Download ubdm-2006-slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Maximizing Classifier Utility
when Training Data is Costly
Gary M. Weiss
Ye Tian
Fordham University
Outline

Introduction



Experimental Methodology
Results




Motivation, cost model
Adult data set
Progressive Sampling
Related Work
Future Work/Conclusion
August 20, 2006
UBDM 2006 Workshop
2
Motivation

Utility-Based Data Mining


Concerned with utility of overall data mining process
A key cost is the cost of training data


First ones to analyze the impact of a very simple cost model


These costs often ignored (except for active learning)
In doing so we fill a hole in existing research
Our cost model

A fixed cost for acquiring labeled training examples



No separate cost for class labels, missing features, etc.
Turney1 called this the “cost of cases”
No control over which training examples chosen

No active learning
August 20, 2006
UBDM 2006 Workshop
3
Motivation (cont.)

Efficient progressive sampling2

Determines “optimal” training set size
Optimal is where the learning curve reaches a plateau
 Assumes data acquisition costs are essentially zero


What if the acquisition costs are significant?
August 20, 2006
UBDM 2006 Workshop
4
Motivating Examples

Predicting customer behavior/buying potential



Training data from D&B and Ziff-Davis
These and other “information vendors” make
money by selling information
Poker playing

Learn about an opponent by playing him
August 20, 2006
UBDM 2006 Workshop
5
Outline

Introduction



Experimental Methodology
Results




Motivation, cost model
Adult data set
Progressive Sampling
Related Work
Future Work/Conclusion
August 20, 2006
UBDM 2006 Workshop
6
Experiments

Use C4.5 to determine relationship between
accuracy and training set size



Random sampling to reduce training set size
For this talk we focus on adult data set



20 runs used to increase reliability of results
~ 21,000 examples
We utilize a predetermined sampling schedule
CPU times recorded, mainly for future work
August 20, 2006
UBDM 2006 Workshop
7
Measuring Total Utility

Total cost = Data Cost + Error Cost
= n∙Ctr + e ∙|S| ∙Cerr
n = number training examples
e = error rate
|S| = number examples in score set
Ctr = cost of a training example
Cerr = cost of an error



Will know n and e for any experiment
With domain knowledge can estimate Ctr, Cerr, |S|
But we don’t have this knowledge
Treat Ctr and Cerr as parameters and vary them
 Assume |S| = 100 with no loss of generality


August 20, 2006
If |S| is 100,000 then look at results for Cerr/1,000
UBDM 2006 Workshop
8
Measuring Total Utility (cont.)

Now only look at cost ratio, Ctr:Cerr



Typical values evaluated: 1:1, 1:1000, etc.
Relative cost ratio is Cerr/Ctr
Example

If cost ratio is 1:1000 then even trade-off if buying
1000 training examples eliminates 1 error

Alternatively: buying 1000 examples is worth a 1%
reduction in error rate (then can ignore |S| = 100)
August 20, 2006
UBDM 2006 Workshop
9
Outline

Introduction



Experimental Methodology
Results




Motivation, cost model
Adult data set
Progressive Sampling
Related Work
Future Work/Conclusion
August 20, 2006
UBDM 2006 Workshop
10
Learning Curve
Accuracy (%)
87
84
No plateau
change = 0.3%
81
78
75
0
3,000
6,000
9,000
12,000
15,000
Training Set Size
August 20, 2006
UBDM 2006 Workshop
11
Utility Curves
180,000
10:1
150,000
Total Cost
1:7500
120,000
1:5000
90,000
1:3000
60,000
1:1000
30,000
1:1
0
0
4,000
8,000
12,000
16,000
Training Set Size
August 20, 2006
UBDM 2006 Workshop
12
Utility Curves (Normalized Cost)
100%
1:5000
Normalized Cost
80%
60%
1:1000
40%
1:100
1:50,000
1:500
20%
1:10
0%
0
4,000
8,000
12,000
16,000
Training Set Size
August 20, 2006
UBDM 2006 Workshop
13
Optimal Training Set Size Curve
Optimal Training Set Size
15,000
85.9%
85.8%
12,000
9,000
85.6%
85.4%
6,000
85.1%
3,000 84.8%
Note: accuracy shown near data point
0
0
10,000
20,000
30,000
40,000
Relative Cost
August 20, 2006
UBDM 2006 Workshop
14
Value of Optimal Curve

Even without specific cost information, this
chart could be useful for a practitioner


Can put bounds on appropriate training set size
Analogous to Drummond and Holte’s cost curves3
They looked at cost ratio of false positives and negatives
 We look at cost ratio of errors vs. cost of data


Both types of curves allows the practitioner to
understand the impact of the various costs
August 20, 2006
UBDM 2006 Workshop
15
Idealized learning curve
Accuracy
100
90
accuracy = training size/training size + 1
80
0
1,000
2,000
3,000
4,000
5,000
Optimal Training Set Size (K)
Training Set Size
100
80
60
40
20
optimal = 10sqroot(RC) -1
0
0
20
40
60
Relative Cost
August 20, 2006
UBDM 2006 Workshop
80
100
Millions
16
Outline

Introduction



Experimental Methodology
Results




Motivation, cost model
Adult data set
Progressive Sampling
Related Work
Future Work/Conclusion
August 20, 2006
UBDM 2006 Workshop
17
Progressive Sampling

We want to find the optimal training set size


Need to determine when to stop acquiring data
before acquiring all of it!
Strategy: use a progressive sampling strategy

Key issues:
When do we stop?
 What sampling schedule should we use?

August 20, 2006
UBDM 2006 Workshop
18
Our Progressive Sampling Strategy

We stop after first increase in total cost


Results therefore never optimal, but near-optimal
if learning curve is non-decreasing
We evaluate 2 simple sampling schedules




S1: 10, 50, 100, 500, 1000, 2000, …, 9000,
10,000, 12,000, 14,000, …
S2: 50, 100, 200, 400, 800, 1600, …
S2 & S1 are similar given modest sized data sets
Could use an adaptive strategy
August 20, 2006
UBDM 2006 Workshop
19
Adult Data Set: S1 vs. Straw Man
Total Cost (100K)
1.5
1.0
0.5
Straw Man Strategy
S1 Strategy
0.0
1:1
1:20
1:500
1:1000
1:5000
1:10000
Cost Ratio
August 20, 2006
UBDM 2006 Workshop
20
Progressive Sampling Conclusions

We can use progressive sampling to
determine a near optimal training set size



Effectiveness mainly based on how well behaved
the learning curve is (i.e., non-decreasing)
Sampling schedule/batch size is also important
Finer granularity requires more CPU time
But if data costly, CPU time most likely less expensive
 In our experiments, cumulative CPU time < 1 minute

August 20, 2006
UBDM 2006 Workshop
21
Related Work

Efficient progressive sampling2


It tries to efficiently find the asymptote
That work has a data cost of ε


Stop only when added data has no benefit
Active Learning

Similar in that data cost is factored in but setting different
User has control over which examples are selected
or features measured
 Does not address simple “cost of cases” scenario


Find best class distribution when training data costly4


Assumes training set size limited but size pre-specified
Finds the best class distribution to maximize performance
August 20, 2006
UBDM 2006 Workshop
22
Limitations/Future Work

Improvements:








Bigger data sets where learning curve plateaus
More sophisticated sampling schemes
Incorporate cost-sensitive learning (cost FP ≠ FN)
Generate better behaved learning curves
Include CPU time in utility metric
Analyze other cost models
Study the learning curves
Real world motivating examples

August 20, 2006
Perhaps with cost information
UBDM 2006 Workshop
23
Conclusion


We analyze impact of training data cost on
classification process
Introduce new ways of visualizing the impact
of data cost



Utility curves
Optimal training set size curves
Show that we can use progressive sampling
to help learn a near-optimal classifier
August 20, 2006
UBDM 2006 Workshop
24
We Want Feedback

We are continuing this work

Clearly many minor enhancements possible




Feel free to suggest some more
Any major new directions/extensions?
What if anything is most interesting?
Any really good motivating examples that you are
familiar with
August 20, 2006
UBDM 2006 Workshop
25
Questions?

If I have run out of time, please find me
during the break!!
August 20, 2006
UBDM 2006 Workshop
26
References
1.
P. Turney (2000). Types of cost in inductive concept learning.
Workshop on Cost-Sensitive Learning at the 17th International
Conference on Machine Learning.
2.
F. Provost, D. Jensen & T. Oates (1999). Proceedings of the 5th
International Conference on Knowledge Discovery and Data
Mining.
3.
C. Drummond & R. Holte (2000). Explicitly Representing Expected
Cost: An Alternative to ROC Representation. Proceedings of the
6th ACM SIGKDD International Conference of Knowledge
Discovery and Data Mining, 198-207.
4.
G. Weiss & F. Provost (2003). Learning when Training Data are
Costly: The Effect of Class Distribution on Tree Induction, Journal
of Artificial Intelligence Research, 19:315-354.
August 20, 2006
UBDM 2006 Workshop
27
Learning Curves for Large Data Sets
90
adult
Accuracy (%)
80
network1
70
blackjack
60
coding
boa1
50
0
3,000
6,000
9,000
12,000
15,000
Training Set Size
August 20, 2006
UBDM 2006 Workshop
28
Optimal Curves for Large Data Sets
Optimal Training Set Size
15,000
12,000
9,000
coding
6,000
blackjack
network1
3,000
boa1
0
0
5,000
10,000
15,000
20,000
Relative Cost
August 20, 2006
UBDM 2006 Workshop
29
Learning Curves for Small Data Sets
100
kr-vs-kp
Accuracy (%)
90
breast-wisc
80
crx
german
70
60
move
50
0
500
1,000
1,500
2,000
2,500
Training Set Size
August 20, 2006
UBDM 2006 Workshop
30
Optimal Curves for Small Data Sets
Optimum Training Set Size
2,500
2,000
move
kr-vs-kp
1,500
german
1,000
500
breast-wisc
crx
0
0
500
1,000
1,500
2,000
2,500
3,000
3,500
Relative Cost
August 20, 2006
UBDM 2006 Workshop
31
Results for Adult Data Set
Relative
Optimal-S1
Cost Ratio Size
Cost
CPU
1
10
34 0.00
10
10
25 0.00
20
500
2,233 0.20
200
500
3,966 0.20
500
500
9,165 0.20
5,000 5,000 79,450 4.17
10,000 9,000 152,900 9.15
15,000 9,000 224,850 9.15
20,000 9,000 296,800 9.15
50,000 15,960 721,460 20.89
August 20, 2006
Size
50
50
50
1,000
2,000
6,000
7,000
7,000
7,000
7,000
S1
Cost
74
292
2,470
4,266
9,945
79,800
154,700
228,550
302,400
745,500
UBDM 2006 Workshop
CPU
0.00
0.00
0.00
0.53
1.23
5.27
6.48
6.48
6.48
6.48
Size
100
100
100
800
1,600
12,800
12,800
15,960
15,960
15,960
S2
Cost
122
319
538
4,060
9,480
83,700
154,600
226,860
297,160
718,960
CPU
0.00
0.00
0.00
0.40
0.92
14.84
14.84
20.88
20.88
20.88
32
Optimal vs. S1 for Large Data Sets
Relative
Cost Ratio
1
20
500
1,000
5,000
10,000
15,000
20,000
50,000
August 20, 2006
Increase In Total Cost: S1 vs. S1-optimal
Adult Blackjack Boa1 Coding Network1
115.7%
10.6%
8.5%
3.2%
0.4%
1.2%
1.6%
1.9%
3.3%
53.2%
34.6%
1.0%
2.6%
1.4%
1.1%
1.6%
1.9%
0.7%
70.1%
5.1%
1.2%
2.3%
4.7%
5.9%
6.3%
6.5%
6.9%
UBDM 2006 Workshop
62.8%
2.0%
2.1%
0.6%
0.2%
0.0%
0.0%
0.0%
0.0%
91.0%
0.7%
2.7%
3.6%
1.5%
1.3%
1.2%
1.1%
1.0%
33
Related documents