Download Efficient Text Categorization with a Large Number of Classes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computational complexity theory wikipedia , lookup

Data analysis wikipedia , lookup

Machine learning wikipedia , lookup

Corecursion wikipedia , lookup

Inverse problem wikipedia , lookup

Theoretical computer science wikipedia , lookup

Pattern recognition wikipedia , lookup

Transcript
Using Error-Correcting Codes for Efficient Text
Categorization with a Large Number of
Categories
Rayid Ghani
Center for Automated Learning & Discovery
Carnegie Mellon University
Some Recent Work
 Learning from Sequences of fMRI Brain Images (with Tom
Mitchell)
 Learning to automatically build language-specific corpora
from the web (with Rosie Jones & Dunja Mladenic)
 Effect of Smoothing on Naive Bayes for Text Classification
(with Tong Zhang @ IBM Research)
 Hypertext Categorization using links and extracted
information (with Sean Slattery & Yiming Yang)
 Hybrids of EM & Co-Training for semi-supervised learning
(with Kamal Nigam)
 Error-Correcting Output Codes for Text Classification
Text Categorization
Domains:
•Topics
•Genres
•Languages
$$$ Making
Numerous Applications
•Search Engines/Portals
•Customer Service
•Email Routing ….
Problems
 Practical applications such as web portal deal
with a large number of categories
 A lot of labeled examples are needed for
training the system
How do people deal with a
large number of classes?
 Use fast multiclass algorithms (Naïve Bayes)

Builds one model per class
 Use Binary classification algorithms (SVMs) and
break an n class problems into n binary problems
 What happens with a 1000 class problem?
 Can we do better?
ECOC to the Rescue!
 An n-class problem can be solved by solving
log2n binary problems
 More efficient than one-per-class
 Does it actually perform better?
What is ECOC?
 Solve multiclass problems by decomposing
them into multiple binary problems (Dietterich
& Bakiri 1995)
 Use a learner to learn the binary problems
Testing
TrainingECOC
ECOC
f1 f2 f3 f4
A
B
C
D
0
1
0
0
0
0
1
1
1
1
1
0
1
0
1
0
X
1 1 1 1
ECOC - Picture
f1 f2 f3 f4
A
B
C
D
A
B
C
D
0
1
0
0
0
0
1
1
1
1
1
0
1
0
1
0
ECOC - Picture
f1 f2 f3 f4
A
B
C
D
A
B
C
D
0
1
0
0
0
0
1
1
1
1
1
0
1
0
1
0
ECOC - Picture
f1 f2 f3 f4
A
B
C
D
A
B
C
D
0
1
0
0
0
0
1
1
1
1
1
0
1
0
1
0
ECOC - Picture
f1 f2 f3 f4
A
B
C
D
A
B
C
D
0
1
0
0
X
1 1 1 1
0
0
1
1
1
1
1
0
1
0
1
0
ECOC works but…
 Increased code length = Increased Accuracy
 Increased code length = Increased
Computational Cost
E
f
f
i
c
i
e
n
c
y
Naïve
Bayes
GOAL
ECOC
(as used in Berger 99)
Classification Performance
Choosing the codewords
 Random? [Berger 1999, James 1999]


Asymptotically good (the longer the better)
Computational Cost is very high
 Use Coding Theory for Good Error-Correcting
Codes? [Dietterich & Bakiri 1995]

Guaranteed properties for a fixed-length code
Experimental Setup
 Generate the code

BCH Codes
 Choose a Base Learner

Naive Bayes Classifier as used in text
classification tasks (McCallum & Nigam 1998)
Text Classification with Naïve Bayes
 “Bag of Words” document representation
 Estimate parameters of generative model:
 N(word , doc)
| V |   N(doc )
1
P( word | class ) 
docclass
docclass
 Naïve Bayes classification:
P(class ) 
P(class | doc) 
 P( word | class )
worddoc
P(doc)
Industry Sector Dataset
[McCallum et al. 1998, Ghani 2000]

Consists of company web pages classified into
105 economic sectors
Results
Industry Sector Data Set
Naïve
Bayes
66.1%
Shrinkage1
76%
MaxEnt2 MaxEnt/ ECOC
w Prior3 63-bit
79%
81.1%
88.5%
ECOC reduces the error of the Naïve Bayes
Classifier by 66% with no increase in
computational cost
1.
(McCallum et al. 1998)
2,3. (Nigam et al. 1999)
Min HD for correctly classified examples
1000
Frequency
800
600
400
200
0
0
1
2
3
4
5
6
7
8
9
10
9
10
Min HD
Min HD for wrongly classified examples
100
Frequency
80
60
40
20
0
0
1
2
3
4
5
Min HD
6
7
8
ECOC for better Precision
1
ECOC
NB
0.95
Precision
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.49
0.6
0.65
0.7
0.75
0.81
Recall
0.86
0.91
0.97
1
ECOC for better Precision
0.75
NB
0.7
15bit
ECOC
Precision
0.65
0.6
0.55
0.5
0.45
0.4
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
New Goal
E
f
f
i
c
i
e
n
c
y
NB
GOAL
ECOC
(as used in Berger 99)
Classification Performance
Solutions
 Design codewords that minimize cost and
maximize “performance”
 Investigate the assignment of codewords to
classes
 Learn the decoding function
 Incorporate unlabeled data into ECOC
What happens with sparse data?
% Decrease in Error
Percent Decrease in Error with Training size and
length of code
70
65
60
55
15bit
50
45
40
31bit
63bit
35
30
0
20
40
60
Training Size
80
100
Use unlabeled data with a large
number of classes
 How?


Use EM
Mixed Results
 Think Again!


Use Co-Training
Disastrous Results
 Think one more time
How to use unlabeled data?
 Current learning algorithms using unlabeled
data (EM, Co-Training) don’t work well with
a large number of categories
 ECOC works great with a large number of
classes but there is no framework for using
unlabeled data
ECOC + CoTraining = ECoTrain
 ECOC decomposes multiclass problems into
binary problems
 Co-Training works great with binary
problems
 ECOC + Co-Train = Learn each binary
problem in ECOC with Co-Training
ECOC+CoTrain - Results
Algorithm
300L+
0U
76
50L +
250U
Per Class
67
5L +
295U
Per Class
40.3
76.5
68.5
49.2
Uses
Unlabeled Data
105Class
Problem
68.2
51.4
67.6
50.1
Uses Unlabeled
Data
72.0
56.1
Per Class
Naïve Bayes
ECOC 15bit
EM
Co-Train
ECoTrain
(ECOC + CoTraining)
No
Data
Uses
Unlabeled
What Next?
 Use improved version of co-training
(gradient descent)


Less prone to random fluctuations
Uses all unlabeled data at every iteration
 Use Co-EM (Nigam & Ghani 2000) - hybrid
of EM and Co-Training
Potential Drawbacks
 Random Codes throw away the real-world
nature of the data by picking random
partitions to create artificial binary problems
Summary
 Use ECOC for efficient text classification
with a large number of categories
 Increase Accuracy & Efficiency
 Use Unlabeled data by combining ECOC and
Co-Training
 Generalize to domain-independent
classification tasks involving a large number
of categories