Download Employing EM and Pool-Based Active Learning for Text Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Parallel port wikipedia , lookup

Transcript
Employing EM and Pool-Based
Active Learning for Text
Classification
Andrew McCallum
Kamal Nigam
Just Research and Carnegie Mellon University
Text Active Learning
• Many applications
• Scenario: ask for labels of a few documents
• While learning:
– Learner carefully selects unlabeled document
– Trainer provides label
– Learner rebuilds classifier
Query-By-Committee (QBC)
• Label documents with high classification
variance
• Iterate:
– Create a committee of classifiers
– Measure committee disagreement about the
class of unlabeled documents
– Select a document for labeling
• Theoretical results promising
[Freund et al. 97] [Seung et al. 92]
Text Framework
• “Bag of Words” document representation
• Naïve Bayes classification:
P (class ) 
P (class | doc ) 
 P ( word | class)
worddoc
P (doc )
• For each class, estimate P(word|class)
Outline: Our approach
• Create committee by sampling from
distribution over classifiers
• Measure committee disagreement with KLdivergence of the committee members
• Select documents from a large pool using
both disagreement and density-weighting
• Add EM to use documents not selected for
labeling
Creating Committees
• Each class a distribution of word frequencies
• For each member, construct each class by:
– Drawing from the Dirichlet distribution
defined by the labeled data
MAP classifier
Member 1
Member 2
labeled
data
Member 3
Classifier distribution
Committee
Measuring Committee Disagreement
• Kullback-Leibler Divergence to the mean
– compares differences in how members “vote”
for classes
– Considers entire class distribution of each
member
– Considers “confidence” of the top-ranked class
•
 Pk (c ) 

Disagreement  
Pk (c )  log 

 P (c ) 
kcommittee cclasses
 avg 
Selecting Documents
• Stream-based sampling:
– Disagreement => Probability of selection
– Implicit (but crude) instance distribution
information
• Pool-based sampling:
– Select highest disagreement of all documents
– Lose distribution information
Disagreement
Density-weighted pool-based sampling
• A balance of disagreement
and distributional information
• Select documents by:
arg max Density(d )  Disagreement (d )
dunlabeled
• Calculate Density by:
– (Geometric) Average Distance
to all documents
•
Distance( di , d j )  e
~
 β D[ P ( word|d j )||P ( word|di )]
Disagreement
Density
Datasets and Protocol
• Reuters-21578 and subset of Newsgroups
• One initial labeled document per class
• 200 iterations of active learning
computers
mac
acq
trade
corn
...
ibm
X
windows graphics
QBC on Reuters
Title:
Title:
Title:
Creator:
gnuplot
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
Creator:
gnuplot
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
Creator:
gnuplot
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
acq, P(+) = 0.25
trade, P(+) = 0.038
corn, P(+) = 0.018
Selection comparison on News5
Title:
Creator:
gnuplot
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
EM after Active Learning
After active learning
only a few documents
have been labeled
Use EM to predict the
labels of the remaining
unlabeled documents
Use all documents to build a new classification
model, which is often more accurate.
QBC and EM on News5
Title:
Creator:
gnuplot
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
Related Work
• Active learning with text:
– [Dagan & Engelson 95]: QBC Part of speech
tagging
– [Lewis & Gale 94]: Pool-based non-QBC
– [Liere & Tadepalli 97 & 98]: QBC Winnow &
Perceptrons
• EM with text:
– [Nigam et al. 98]: EM with unlabeled data
Conclusions & Future Work
• Small P(+) => better active learning
• Leverage unlabeled pool by:
– pool-based sampling
– density-weighting
– Expectation-Maximization
• Different active learning approaches
a la [Cohn et al. 96]
• Interleaved EM & active learning