Download Efficient Text Categorization with a Large Number of Classes

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery Carnegie Mellon University Some Recent Work  Learning from Sequences of fMRI Brain Images (with Tom Mitchell)  Learning to automatically build language-specific corpora from the web (with Rosie Jones & Dunja Mladenic)  Effect of Smoothing on Naive Bayes for Text Classification (with Tong Zhang @ IBM Research)  Hypertext Categorization using links and extracted information (with Sean Slattery & Yiming Yang)  Hybrids of EM & Co-Training for semi-supervised learning (with Kamal Nigam)  Error-Correcting Output Codes for Text Classification Text Categorization Domains: •Topics •Genres •Languages $$$ Making Numerous Applications •Search Engines/Portals •Customer Service •Email Routing …. Problems  Practical applications such as web portal deal with a large number of categories  A lot of labeled examples are needed for training the system How do people deal with a large number of classes?  Use fast multiclass algorithms (Naïve Bayes)  Builds one model per class  Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems  What happens with a 1000 class problem?  Can we do better? ECOC to the Rescue!  An n-class problem can be solved by solving log2n binary problems  More efficient than one-per-class  Does it actually perform better? What is ECOC?  Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995)  Use a learner to learn the binary problems Testing TrainingECOC ECOC f1 f2 f3 f4 A B C D 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 0 X 1 1 1 1 ECOC - Picture f1 f2 f3 f4 A B C D A B C D 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 0 ECOC - Picture f1 f2 f3 f4 A B C D A B C D 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 0 ECOC - Picture f1 f2 f3 f4 A B C D A B C D 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 0 ECOC - Picture f1 f2 f3 f4 A B C D A B C D 0 1 0 0 X 1 1 1 1 0 0 1 1 1 1 1 0 1 0 1 0 ECOC works but…  Increased code length = Increased Accuracy  Increased code length = Increased Computational Cost E f f i c i e n c y Naïve Bayes GOAL ECOC (as used in Berger 99) Classification Performance Choosing the codewords  Random? [Berger 1999, James 1999]   Asymptotically good (the longer the better) Computational Cost is very high  Use Coding Theory for Good Error-Correcting Codes? [Dietterich & Bakiri 1995]  Guaranteed properties for a fixed-length code Experimental Setup  Generate the code  BCH Codes  Choose a Base Learner  Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998) Text Classification with Naïve Bayes  “Bag of Words” document representation  Estimate parameters of generative model:  N(word , doc) | V |   N(doc ) 1 P( word | class )  docclass docclass  Naïve Bayes classification: P(class )  P(class | doc)   P( word | class ) worddoc P(doc) Industry Sector Dataset [McCallum et al. 1998, Ghani 2000]  Consists of company web pages classified into 105 economic sectors Results Industry Sector Data Set Naïve Bayes 66.1% Shrinkage1 76% MaxEnt2 MaxEnt/ ECOC w Prior3 63-bit 79% 81.1% 88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost 1. (McCallum et al. 1998) 2,3. (Nigam et al. 1999) Min HD for correctly classified examples 1000 Frequency 800 600 400 200 0 0 1 2 3 4 5 6 7 8 9 10 9 10 Min HD Min HD for wrongly classified examples 100 Frequency 80 60 40 20 0 0 1 2 3 4 5 Min HD 6 7 8 ECOC for better Precision 1 ECOC NB 0.95 Precision 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.49 0.6 0.65 0.7 0.75 0.81 Recall 0.86 0.91 0.97 1 ECOC for better Precision 0.75 NB 0.7 15bit ECOC Precision 0.65 0.6 0.55 0.5 0.45 0.4 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 New Goal E f f i c i e n c y NB GOAL ECOC (as used in Berger 99) Classification Performance Solutions  Design codewords that minimize cost and maximize “performance”  Investigate the assignment of codewords to classes  Learn the decoding function  Incorporate unlabeled data into ECOC What happens with sparse data? % Decrease in Error Percent Decrease in Error with Training size and length of code 70 65 60 55 15bit 50 45 40 31bit 63bit 35 30 0 20 40 60 Training Size 80 100 Use unlabeled data with a large number of classes  How?   Use EM Mixed Results  Think Again!   Use Co-Training Disastrous Results  Think one more time How to use unlabeled data?  Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories  ECOC works great with a large number of classes but there is no framework for using unlabeled data ECOC + CoTraining = ECoTrain  ECOC decomposes multiclass problems into binary problems  Co-Training works great with binary problems  ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training ECOC+CoTrain - Results Algorithm 300L+ 0U 76 50L + 250U Per Class 67 5L + 295U Per Class 40.3 76.5 68.5 49.2 Uses Unlabeled Data 105Class Problem 68.2 51.4 67.6 50.1 Uses Unlabeled Data 72.0 56.1 Per Class Naïve Bayes ECOC 15bit EM Co-Train ECoTrain (ECOC + CoTraining) No Data Uses Unlabeled What Next?  Use improved version of co-training (gradient descent)   Less prone to random fluctuations Uses all unlabeled data at every iteration  Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training Potential Drawbacks  Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems Summary  Use ECOC for efficient text classification with a large number of categories  Increase Accuracy & Efficiency  Use Unlabeled data by combining ECOC and Co-Training  Generalize to domain-independent classification tasks involving a large number of categories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Efficient Text Categorization with a Large Number of Classes