Download labeler - Interactive Computing Lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Transcript
Get Another Label?
Improving Data Quality and Data
Mining Using Multiple, Noisy
Labelers
Victor Sheng, Foster Provost, Panos Ipeirotis
KDD 2008
New York University
Stern School
Outsourcing preprocessing
• Traditionally, data mining teams have invested substantial
internal resources in data formulation, information extraction,
cleaning, and other preprocessing
– Raghu from his Innovation Lecture
“the best you can expect are noisy labels”
• Now, we can outsource preprocessing tasks, such as labeling,
feature extraction, verifying information extraction, etc.
– using Mechanical Turk, Rent-a-Coder, etc.
– quality may be lower than expert labeling (much?)
– but low costs can allow massive scale
• The ideas may apply also to focusing user-generated tagging,
crowdsourcing, etc.
2
ESP Game (by Luis von Ahn)
3
Other “free” labeling schemes
• Open Mind initiative (http://www.openmind.org)
• Other GWAP games
– Tag a Tune
– Verbosity (tag words)
– Matchin (image ranking)
• Web 2.0 systems?
– Can/should tagging be directed?
4
Noisy labels can be problematic
Many tasks rely on high-quality labels for objects:
–
–
–
–
–
learning predictive models
searching for relevant information
finding duplicate database records
image recognition/labeling
song categorization
• Noisy labels can lead to degraded performance
5
Quality and Classification Performance
• Labeling quality (labeling accuracy P) increases
classification quality increases
P = 1.0
100
P = 0.8
80
P = 0.6
70
60
P = 0.5
50
300
280
260
240
220
200
180
160
140
120
100
80
60
40
20
40
1
Accuracy
90
Number of examples (Mushroom)
6
Majority Voting and Label Quality
• Ask multiple “noisy” labelers, keep majority label as “true”
label
• Given 2N+1 labelers with uniform accuracy P, integrated
quality is P(Bin(2N+1, P) <= N)
P is probability
of individual labeler
being correct
Integrated quality
1
0.9
P=1.0
0.8
P=0.9
0.7
P=0.8
0.6
P=0.7
0.5
P=0.6
0.4
P=0.5
0.3
P=0.4
0.2
1
3
5
7
9
11
13
Number of labelers
 (1) removing noisy labelers, (2) collect more labels as much as possible
Labeling Methods
• MV: majority voting
• Uncertainty preserving labeling (soft label)
– Multiplied Examples (ME): for each example xi, ME considers the
multiset of existing labels (Li,j), and for each Lij, it creates a replica with
weight 1/|Li,j| (??)
• These replicas with weights are fed into the classifier for training
• Another method is to use Naive Bayes (see WSDM’11):
Modeling Annotator Accuracies for Supervised Learning, WSDM 2011
Get Another Label?
• Single Labeling (SL)
– One label per each sample; get more samples
• Repeated labeling (w/ a fixed set of samples)
– Round-robin Repeated Labeling
• Fixed Round Robin (FRR)
– Keep labeling the same set of samples
• Generalized Round Robin (GRR)
– Keep labeling the same set of samples, yet giving highest preference to a
sample with the fewest labels
– Selective Repeated-Labeling
• Consider label uncertainty of a sample (LU)
• Considerer classification (model) uncertainty of a sample (MU)
• Consider both label uncertainty and model uncertainty (LMU)
Experiment Setting
• 70/30 division (70% for training, 30% for testing)
• Uniform accuracy of labeling as P; for each sample, a correct
label is given with probability P
• Classifier: C4.5 in WEKA
Single Labels vs. Majority Voting
• Sample size of training data set matters
• When sample size is large enough, MV is better than SL
• With low noise, more (single labeled) examples better
MV-FRR
(50 examples)
Tradeoffs for Modeling
• Get more labels  Improve label quality  Improve classification
• Get more examples  Improve classification
P = 1.0
100
P = 0.8
80
P = 0.6
70
60
P = 0.5
50
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
60
40
20
40
1
Accuracy
90
Number of examples (Mushroom)
Selective Repeated-Labeling
• We have seen:
– With enough examples and noisy labels, getting multiple labels is
better than single-labeling
– When we consider costly preprocessing, the benefit is magnified
• Can we do better than the basic strategies?
• Key observation: we have additional information to guide
selection of data for repeated labeling
– Multi-set labels; e.g., {+,-,+,+,-,+} vs. {+,+,+,+}
13
Natural Candidate: Entropy
• Entropy is a natural measure of label uncertainty:
– E({+,+,+,+,+,+})=0
– E({+,-, +,-, +,- })=1
| S |
| S | | S |
| S |
E (S )  
log 2

log 2
|S|
|S| |S|
|S|
| S  |: positive | S  |: negative
Strategy: Get more labels for examples with high-entropy label
multisets
14
What Not to Do: Use Entropy
0.95
Improves at first,
hurts in long run
Labeling quality
0.9
0.85
0.8
0.75
0.7
ENT ROPY
GRR
0.65
0.6
0
400
800
1200
1600
Number of labels (waveform, p=0.6)
2000
Why not Entropy
• In the presence of noise, entropy will be high
even with many labels
• Entropy is scale invariant
(3+ , 2-) has same entropy as (600+ , 400-)
16
Binomial Dist. with Uniform Prior Dist.
Let Y ~ Bin(, n) where  ~ Uniform( 0, 1 ).
p( | Y )  fUnif (0,1) f Bin (Y |  )
You cannot just call the posterior a
binomial distribution because you are
conditioning on Y and  is a random
variable, not the other way around.
n Y
 1 *   (1   ) n Y
Y 
  Y (1   ) n Y
The pdf for the beta distributi on which is known to be proper is :
Note : ( k )  ( k  1)!
Γ(α  β) α 1
β 1
Beta(x| α,β) 
x (1  x ) ( 0  x  1 and α,β  0).
[Gamma Fun ction]
Γ(α)Γ(β)
Let x   , α  Y  1, β  n  Y  1
Γ(n  2 )
Thus, p(π | Y, n) ~ Beta(Y  1, n - Y  1) 
x (Y 1)1 (1  x )( n Y 1)1
Γ(Y  1 )Γ ( n  Y  1 )
Γ(n  2 )
p(π | Y, n) 
xY (1  x ) n Y
Γ(Y  1 )Γ ( n  Y  1 )
This is the normalization constant to
transform y(1-)n-y into a beta distribution.
Estimating Label Uncertainty (LU)
• Observe +’s and –’s and compute Pr{+|obs} and Pr{-|obs}
• Label uncertainty = tail of beta distribution
– P=0.5, alpha1 and alpha2
– For more accurate estimation, we can instead use 95% HDR,
Highest Density Region (or interval)
Beta(18,8)
95%
HDR
1
SLU
2
posterior
3
4
Beta probability
density function
0
[.51,.84]
0.0
0.5
0.0
1.0
0.2
0.4
0.6
p
0.8
1.0
Label Uncertainty
• p=0.7
• 5 labels
(3+, 2-)
• Entropy ~ 0.97
• CDFb=0.34
Label Uncertainty
• p=0.7
• 10 labels
(7+, 3-)
• Entropy ~ 0.88
• CDFb=0.11
Label Uncertainty
• p=0.7
• 20 labels
(14+, 6-)
• Entropy ~ 0.88
• CDFb=0.04
Labeling quality
Label Uncertainty vs. Round Robin
1
0.9
0.8
GRR
LU
0.7
0.6
0
400
800
1200
1600
2000
Number of labels (waveform, p=0.6)
similar results across a dozen data sets
Another strategy:
Model Uncertainty (MU)
• Learning a model of the data provides an
alternative source of information about label
certainty
• Model uncertainty: get more labels for
instances that cannot be modeled well
• Intuition?
– for data quality, low-certainty “regions” may be
due to incorrect labeling of corresponding
instances
– for modeling: why improve training data quality if
model already is certain there? (LMU)
?
- -- +
+
+
++ - - - - - + +
-+ ++
+ + ++ - - - -- - + +
- - ----++
-+
-+-
?
23
Yet another strategy:
Label & Model Uncertainty (LMU)
• Label and model uncertainty (LMU): avoid
examples where either strategy is certain
S LMU 
S LU  S MU
24
Comparison
Model Uncertainty
Label & Model
alone also improves
Label
Uncertainty
quality
Uncertainty
1
Labeling quality
0.95
0.9
0.85
0.8
GRR
0.75
GRR
MU
LU
LMU
0.7
0.65
0.6
0
400
800
1200
Number of labels (waveform, p=0.6)
1600
2000
Comparison: Model Quality
• Across 12 domains, LMU is always better than GRR.
• LMU is statistically significantly better than LU and MU
90
Label & Model uncertainty
Accuracy
85
80
75
GRR
MU
LU
LMU
70
65
60
0
400
800
1200
Number of labels (spambase, p=0.6)
1600
2000
Summary
Micro-task outsourcing (e.g., MTurk, RentaCoder
ESP game) has changed the landscape for data
formulation
• Repeated labeling can improve data quality and
model quality (but not always)
• When labels are noisy, repeated labeling can be
preferable to single labeling even when labels
aren’t particularly cheap
• When labels are relatively cheap, repeated
labeling can do much better
• Round-robin repeated labeling can do well
• Selective repeated labeling improves substantially
27